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REMARKS 



I. Comments on the Restriction Requirement 

Applicants not, that Claims 10-11 (Group V,. Claim 12 , Group VI,. Claim 16 .Croup X,. and 
Claim 17 (Group XI, are ' 'method of use" claims draw,, ,o methods of using the polynucleotides of 
Group II. winch should be examined together w,th the polynucleotide claims of Group II. per the 
Conimissioner's Not.ee ,n the Official Gazette of March 26. 1996. entitled "Guidance on Treatment of 
Product and Process Claims in light of/,, rc Orhiai. In re Brouucr. and 35 ISC. * 103(b)" which 
sets forth the rules, upon allowance of product claims, for rejoinder of process claims covering the 
same scope of products. 

H. Priority Information 

The Examiner requested that the priority information in the Specification be amended to reflect 
that U.S. application serial number 09/309.320 had issued as a patent. (Office Action, page 4. ) The 
Specification has been amended accordingly. 

HI. Publications Cited in the Office Action 

The Examiner cited Russell |J. Mol. Bio. 244:332-350|. Skolnick et al. |Trends ,n Biotech. 
18( 1 ):34-39|. and Attwood |Science 290: 471-473. 29 October 2000| in support of the enablement 
rcjectton under 35 l-.S.C. * 112. first paragraph .Office Action, page 10, Applicants note that cop.es 
of the Russell. Skoln,ek et al.. and Attwood publications were neither listed on the PTO-892 form nor 
included with the Office Action. 



IV. Objections to ( laims 2-4 

The Examiner objected to Cla,ms 2-4 because "Claim 2 ,s dependent on non-elected claim 1. 
•Office Aeon, page 5. , Amended Claim 2 is an independent claim. ( "laims ^ and 4 depend fro,,, 
(lam, 2. Ilietvtore. Applicants i V s P eet»iill\ remu-t tl, : ,t t»„. j\ i. . .. ■ 
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V. Rejection of Claims 2-4 and 8-9 Inder 35 I'.S.C. § 112, first paragraph, written 
description 

( laims 2 4 and 8-9 have been rejected under the fust paragraph of 35 CSC. 1 12 for aliened 
lack of an adequate written description. The Exa,,„ner alleges that the polynucleotides of Claim 2 
encoding polypeptide fragments, the polynucleotide vanants of Claim 8. and the polynucleotides of 
Claim 8 complementary to or ribonucleotide equivalents of SHQ ID N():2 and SHQ ID NO:2 vanants 
are not adequately described. 1 

Solely in order to expedite prosecution. Applicants have amended Claim 2 such that 
polynucleotides encoding biologically active or immunogenic fragments of SHQ ID NO: I are no loimer 
recited. Therefore the rejection as it pertains to polynucleotides encoding biologically active or 
immunogenic fragments of SEQ ID NO: 1 is moot. 

The Examiner ignores the claim limitations of -at least 90<7< identical to a polynucleotide 
sequence of SEQ ID NO:2" and attempts to introduce a limitation of "function" to the polynucleotide 
variants, limitations which are not present in the pending claims. The Examiner ignores the limitation that 
the claimed polynucleotides comprise a naturally occurring polynucleotide sequence. 

The requirements necessary to fulfill the written description requirement of 35 U.S.C. 1 12. first 
paragraph, are well established by case law. 

■ • • the applicant must also convey with reasonable clarity to those skilled in 
the art that, as of the filing date sought, he or she was in possession of the unction 
The invention is. tor purposes of the "written description- inquiry, whatever n now 
claimed. Vas-Cath. Inc. v. Maluirkar, 19 CSPQ2d 1111.1117 (Fed. Cir. 1991) 

Attention is also draw,, to the Patent and Trademark ( Mice's own "( uudclmes for Ia 
-fPaten, Applications Cnder the oI SC Sec I 12. para. I". P uN,s|,,j .| lUILiai> s. 2()(,1. w |„c 
provide that : 

An applicant may also show that an invention is complete bv disclosure of sufficiently 
^'l.'ik'd. relevant identifying characteristics- winch provide evidence that applicant was 
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in possession of the dunned invention,^ i.e.. complete or purtiul structure, other 
physical und/or chemical properties, functional characteristics w hen coupled with a 
known or disclosed correlation between function and structure, or some combination of 
such characteristics/"' What is conv entional or well know n to one of ordinary skill in the 
art need not be disclosed in detail/" If a skilled artisan w ould ha\e understood the 
inventor to be in possession of the claimed invention at the time of filing, even if every 
nuance of the claims is not explicitly described in the specification, then the adequate 
description requirement is met/'' 

Thus, the written description standard is fulfilled by both what is specifically disclosed and what 
is conventional or well known to one skilled in the art. 

SHQ ID NO: 1 and SEQ ID N():2 are specifically disclosed in the application (see. for 
example, pages 50-52). Variants of SEQ ID NO: 1 are described, for example, at page 6, lines 5-! !. 
In particular, the preferred, more preferred, and most preferred SEQ ID N():l variants ( 80 f /f , ( )() f /r , 
and 95 r r amino acid sequence similarity to SEQ ID NO: 1 ) are described, for example, at page 12, 
lines 23-26. Variants of SEQ ID NO:2 are described, for example, at page 12, line 27 through page 
13, line 19. Incyte clones in which the nucleic acids encoding the human HOST were first identified and 
libraries from which those clones were isolated are described, for example, at page 1 1. line 28 through 
page 12, line 3 of the Specification. Chemical and structural features of HOST are described, for 
example, on page 12, lines 4-20. 

Ciiven SEQ ID NO: 1. one of ordinary skill in the art would recognize a polynucleotide encoding 
a naturally-occurring variant of SEQ) ID NO: 1 having at least 90 r: ; sequence identity to SEQ ID NO:l. 
Given SEQ ID N():2 one of ordinary skill in the art would recognize a naturally-occurring v ariant of 
SEQ ID N():2 having at least 90 r ; sequence identity to SEQ ID NO:2 The Specification describes 
how to use BEAST to determine w hether a given sequence talis within the "at least ( )() f \ identical" 
scope, i Specification, page 4 1 . line 28 through page 42. line 1 3. ) 

There simply is no requirement that the claims recite particular variant polypeptide, variant 
polynucleotide, complementary polynucleotide, or ribonucleotide equivalent polynucleotide sequences 
because the claims alreadv provide sulficient structural definition ot the claimed subject matter. That is. 
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occurring amino acid sequence having at least ViY '< sequence identity to the sequence of SEQ ID NO: 1 
over the entire length of SHQ) II) NO: 1 /* The polynucleotide variants, complenientai \ polynucleotides, 
and ribonucleotide equivalent polynucleotides are defined in terms ot'SHQ II) NO: 2 ("An isolated 
polynucleotide comprising a sequence selected from the group consisting of a) a polynucleotide 
sequence of SEQ ID N():2, b> a naturally-occurring polynucleotide sequence having at least W'i 
sequence identity to the sequence of SHQ ID N():2. over the entire length of SEQ ID NO:2, c> a 
polynucleotide sequence completely complementary to a), d> a polynucleotide sequence completely 
complementary to b) and e) a ribonucleotide equivalent of ai-dC'i 

Because the recited polypeptide variants are defined in terms of SHQ ID NO: 1 . and the recited 
polynucleotide variants, complementary polynucleotides, and ribonucleotide equivalent polynucleotides 
are defined in terms of SEQ ID NO:2, the precise chemical structure of every polypeptide variant, 
every polynucleotide variant, every complementary polynucleotide, and every ribonucleotide equivalent 
polynucleotide within the scope of the claims can be discerned. The Examiner's position is nothing 
more than a misguided attempt to require Applicants to unduly limit the scope of their claimed invention. 
Accordingly, the Specification provides an adequate written description of the recited polypeptide and 
polynucleotide sequences. 

A. The present claims specifically define the claimed genus through the recitation 
of chemical structure 

Court cases in which "DNA claims" have been at issue commonly emphasize that the recitation 

of structural features or chemical or physical properties are important factors to consider in a u ritten 

description analysis of such claims, lor example, in Iier\ \ . Revel. Is |'SI > Q2d 1(>0L l()0(>i|-ed. 

Cir. i. the court stated that: 

If a conception of a DNA requires a precise definition, such as bv structure, formula, 
chemical name or physical properties, as we have held, then a description also requires 
that degree of specificity. 

In a number of instances in u hich clainw t, 0)\ \ ^ < ■ K " f 1 " 1 " ' 1:1 ' 1 ■ ■ 1 
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any reference to structural features. As set forth by the court in Ihiiversity of California v. Fli Liilx 
and Co.. 43 l'SPQ2d 1398. 1406 ( Fed. Cir. 1997): 

In claims to yenetic material, however, a generic statement such as "vertebrate insulin 
cDNA" or "mammalian insulin cDNA," without more, is not an adequate written 
description of the yenus because it does not distinguish the claimed genus from others, 
except by function. 

I hits, the mere recitation of functional characteristics of a DNA. w ithout the definition of 
structural features, has been a common basis by which courts have found invalid claims to DNA. For 
example, m /./7/v, 43 FSPQ2d at 1407, the court found invalid for violation of the written description 
requirement the following claim of U.S. Patent No. 4,652.525: 



1. A recombinant plasmid replicable in procaryotic host containing within its nucleotide 
sequence a subsequence having the structure of the reverse transcript of an mRNA of a 
vertebrate, which mRNA encodes insulin. 

In Fiers, 25 USPQ2d at 1603, the parties were in an interference involving the following count: 

A DNA which consists essentially of a DNA which codes for a human fibroblast 
interferon-beta polypeptide. 

Party Revel in the Fiers case argued that its foreign priority application contained an adequate 
written description of the DNA of the count because that application mentioned a potential method for 
isolating the DNA. The Revel priority application, however, did not have a description of any particular 
DNA structure corresponding to the DNA of the count. The court therefore found that the Revel 
priority application lacked an adequate w ritten description of the subject matter of the count. 

I hus. in Lilly anil Fu rs, nucleic acids were defined on the basis of functional characteristics 
and were found not to comply with the will ten description requirement of 35 I'.S.C. §1 12, i.e.. "an 
mRNA of a vertebrate, which mRNA encodes insulin" in Lilly, and "DNA which codes for a human 
fibroblast interferon beta polypeptide" in Fiers. In contrast to the situation in Lilly and Ficrs. the 
claims at issue in the present application define polynucleotides m terms of chemical structure, rather 
than on t line Hi Mia! char: icftTUt iev ]■, n- ; \ .mpf- tU ,,,t i , — r , i . . . . t . . . < ■ - i ;. > . , 
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2. An isolated polyiuiclcotidc encoding a polypeptide 
comprising an amino acid sequence selected from the group consisting oh . 

h) a naturally-occurring amino acid sequence having at least 90' "< 
sequence identity to the sequence ot SHQ ID NO: 1 over the entire length of 
SHQ ID NO:l. 

8. An isolated polynucleotide comprising a sequence selected from the 
group consisting of. . . 

b) a naturally-occurring polynucleotide sequence having at least 9(Kr 
sequence identity to the sequence of SHQ ID NO:2. over the entire leimth of SHQ ID 

NO:2. . . 

From the above it should be apparent that the claims of the subject application are 
fundamentally different from those found invalid in Lilly and Fiers. The subject matter of the present 
claims is defined in terms of the chemical structure of SHQ ID NO: 1 and SEQ ID NO:2. In the present 
case, there is no reliance merely on a description of functional characteristics of the polynucleotides 
recited by the claims. In fact, there is no recitation of functional characteristics. Moreover, if such 
functional recitations were included, it would add to the structural characterization of the recited 
polynucleotides. The polynucleotides defined in the claims of the present application recite structural 
features, and cases such as Lilly and Fiers stress that the recitation of structure is an important factor to 
consider in a written description analysis of claims of this type. By failing to base its written description 
inquiry "on whatever is now claimed," the Office Action failed to provide an appropriate analysis of the 
present claims and how they differ from those found not to satisfy the written description requirement in 
Lilly and Fiers. 

li. The present claims do not define a genus which is highly diverse 

furthermore, the claims at issue do not describe a genus which could be characterized as hiL'hlv 
diverse. Available evidence illustrates that the claimed genus is of narrow scope. 

In support of this assertion, the Hxaminer's attention is directed to the enclosed reference bv 
Brenner et al. ("Assessing sequence comparison methods with reliable structurally identified distant 
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and with < c H)'r overall sequence id^nt it \ , Brenner et af have determined that 3{Y"< identity is a reliable 
threshold lor establishing evolutionary homology between two sei|uences aligned over at least 150 
residues. ( Brenner et al.. pages 6073 and 6076. ) Furthermore, loeal identity is particularly important iti 
this case tor assessing the significance of the alignments, as Brenner et al. further report that 40^ 
identity over at least 70 residues is reliable in signifying homology betw een proteins. ( Brenner et aL 
page 6076. ) 

The present application is directed, inter alia, to glutathione s-transferase proteins related to 
the amino acid sequence of SHQ ID NO: 1. In accordance with Brenner et al, naturally occurring 
molecules may exist which could be characterized as glutathione s-transterase proteins and which have 
as little as 40 f; r identity ov er at least 70 residues to SHQ ID NO: 1. The "variant language" of the 
present claims recites, for example, polynucleotides encoding a polypeptide comprising "a naturally- 
occurring amino acid sequence having at least 90 f /r sequence identity to the sequence of SEQ ID NO: 1 
over the entire length of SEQ ID NO 17* This variation is far less than that of all potential glutathione s- 
transferase proteins related to SEQ ID NO: 1, i.e., those glutathione s-transterase proteins having as 
little as 40°r identity over at least 70 residues to SEQ ID NO: 1. 

C. The state of the art at the time of the present invention is further advanced 
than at the time of the Lilly and b iers applications 

In the Lilly case, claims of U.S. Patent No. 4,652,525 were found invalid for failing to comply 
with the written description requirement of 35 I'.S.C. § 1 12. The '525 patent claimed the benefit of 
priority of two applications. Application Serial No. S01. 343 filed Mav 27. 1 ( )77. and Application Serial 
No. S05.023 tiled June l ). In the Liers case, party Revel claimed the benefit of prioritv ot an 

Israeli application filed on November 2 1 . Thus, the v\ ntten description inejuirv in those case w a> 

based on the state ot the art at essentially at the "dark ages" of recombinant DNA technology. 

The present application has a prioritv date of November 2b. 1W6. Much has happened in the 
dev elopmeiit of recombinant I )NA techno log v in the 1 7 or more vears from the time ot" filing of t tie 

. • 1 ■ ■ : . : ■ ■ ' ■ i : ■ J ■ n .if. .1,1 , ' . ■ . \ . , , j ,1 . 1 , ; . 
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technology has been developed. Large databases of protein and nucleotide sequences have been 
compiled. Much of the raw material of the human and other genomes has been sequenced. With these 
remarkable advances one of skill in the art would recognize that, given the sequence information of 
SHQ II) NO: 1 and SHQ ID N():2. and the additional extensive detail provided by the subject 
application, the present inventors w ere in possession of the recited polypeptide variants, polynucleotide 
variants, complementary polynucleotides, and ribonucleotide equivalent polynucleotides at the time of 
filing of this application. 

I). Summary 

The Office Action failed to base its written description inquiry "on whatever is now claimed." 
Consequently, the Action did not provide an appropriate analysis of the present claims and how they 
differ from those found not to satisfy the written description requirement in cases such as Lilly and 
Ficrs. In particular, the claims of the subject application are fundamentally different from those found 
inv alid in Lilly and Ficrs. The subject matter of the present claims is defined in terms of the chemical 
structure of SEQ ID NO:l or SEQ ID NO:2. The courts have stressed that structural features are 
important factors to consider in a written description analysis of claims to nucleic acids and proteins. In 
addition, the genus of polynucleotides defined by the present claims is adequately described, as 
ev idenced by Brenner et al. Furthermore, there hav e been remarkable adv ances in the state of the art 
since the Lilly and Ficrs cases, and these adv ances were given no consideration whatsoever in the 
position set forth by the Office Action. 

VI. Re jection of Claims 2-4 and S-V Coder 35 C.S.C. $112, first paragraph, enablement 

Applicants' invent ion is directed, inter alia, to polvnucleotides cno >clint: po I v peptides ( HOST i 
having strong homology to two human Alpha GSTs, p(iTH2 (CI S25W)5: SHQ II) NO:3l and Al - 1 
(CI 25<)141: SHQ II) NO:4l and a mouse Alpha CSH. (1ST rv7 <GI l c >37 10: SHQ ID NO:5> These 
polynucleotides have a varictv of utilities, in particular in expression profiling, and in particular for 
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drug disco wry (sec the Specification at. e.g.. page 34. line 18 through page 37. line 1 1 i. As described 

in the Specification (page 1 1. line 28 through page 12. line 22): 

Nucleic acids encoding the human H(iST of the present invention were first 
identified in lncyte Clone 1553079 from the (Madder tumor cDNA library 
( BEADTET04) through a computer-generated search for amino acid sequence 
alignments. A consensus sequence, SEQ ID N():2, was derived from the following 
overlapping and/or extended nucleic acid sequences: lncyte Clones 1553079/ 
BEADTl 1 104, 1328546/ PANCNOT07. 1422059/ KIDNNO 1 09, and 2188683/ 
PROSNOT26. 

In one embodiment, the invention encompasses the novel human glutathione s- 
transferase, a polypeptide comprising the amino acid sequence of SHQ ID N():l. as 
shown in Figures 1A, IB, and 1C. HGST is 222 amino acids in length and has 
chemical and structural homology with two human Alpha GSTs, pGTH2 (GI 825005; 
SEQ ID NO:3), and A I - 1 (GI 259141 }, and a mouse Alpha GSH, GS T 5.7 (GI 
193710). In particular, HGST shares 5 7 9f overall identity with each of the two human 
GSTs and 59 7< identity with the mouse GST. In addition, various amino acid residues 
found to be essential for the catalytic activity and substrate binding of GSTs are 
conserved in HGST and in the other three GST molecules. These residues are: Y9. 
R13, R20, E32, Q67, T68. R69, E97, D10L E104. and R131. Only residues E97 
and El 04 are not found in the mouse GST. E32 and E97 form salt bridges with R20 
and R69, respectively, and these salt bridges or residues are thought to be important in 
structural stability of the GST molecule and may be important for catalysis. Y9 is 
essential for catalysis by facilitating ionization of GSH. Residues Q67. T68, D101, 
El 04, and R131 are important for the binding of GSH. As illustrated by Figures 3, 4, 
5, and 6, HGST and the three Alpha GSTs have rather similar hydrophobicity plots. 
Figures 7. 8, and 9 show the isoelectric point analyses for HGST, pGHT2, and A 1 - 1 . 
The pi values of 8.8. 9.0, and 9.3, respectively; fall w ithin the range characteristic of 
Alpha GSTs. In addition to bladder tumor, partial transcripts of the cDNA encoding 
HGST are found in fetal tissues (kidney and pancreas) and in prostate tissue adjacent to 
prostate cancer 

Claini> 2 4 and S 9 stand rejected under 35 1' S.( * ^ 112. first paragraph, based en the 
allegation that the claimed inwntion is not adequatek enabled. The rejection alleges m particular, 
regarding the claimed polynucleotide variants of SEQ ID N():2\ ""absent factual evidence, a percentage 
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sequence similarity of less than l()0 r '< is not deemed to reasonably support, to one skilled in the art. as 
to whether the biochemical activity of the claimed subject matter would be the same as that of such a 
similar known hiomolecule/" (Office Action, page 9.) The Examiner further alleged that the claims 
contain "subject matter which was not described in the specification in such a way as to enable one of 
skill in the art to which it pertains, or with which it is most nearly connected, to make and/or use the 
invention" and that "a skilled artisan would be forced into undue experimentation to practice (i.e.. make 
and use) the invention as is broadly claimed/* (Office Action, pages 9- 10. ) 

The Examiner further alleged that "|t|he need for non-routine experimentation demonstrates that 
the specification is not enabled for any asserted use or well-established use for bacterial membrane 
polypeptides."" (Office Action, page 10.) 

The claimed polynucleotides are enabled, i.e.. they are supported by the Specification and what 
is well known in the art. 

A. How to make 

SEQ ID NO: 1 and SEQ ID NO:2 are specifically disclosed in the application (see. for 
example, pages 50-52 of the Sequence Listing). Variants of SEQ ID NO:l are disclosed, for example, 
at page 6, lines 6-14. In particular, the preferred, more preferred, and most preferred SEQ ID NO: 1 
variants (809r , 909? , and 959r amino acid sequence similarity to SEQ ID NO: 1 ) are disclosed, for 
example, at page 12. lines 23-26. Variants of SEQ ID NO:2 are disclosed, for example, at page 12. 
line 27 through page 13. line 19 and page 13. line 26 through page 14, line 21. Incyte clones in w hich 
the nucleic acids encoding the human HOST were first identified and libraries from w hich those clones 
were isolated are disclosed, lor example, at page 1 1. line IS through page 12. line 3 ot the 
Specification. Chemical and structural features of HOST are disclosed, for example, on page 12. lines 
4-20. 

'Applicants note tint thr <preitle ation and claims ,ln not nvin- "bacterial membrane 
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The Hxaminer alleged that the claims were not enabled because "|i|n absence of further 
guidance from Applicants, the skilled artisan would have to de novo discover what the appropriate 
conservative amino acid substitutions are and what critical regions can and cannot tolerate such 
substitutions" and "absent factual evidence, a percentage sequence similarity of less than l()O r v is not 
deemed to reasonably support, to one skilled in the art, as to whether the biological activity ot the 
claimed subject matter would be the same as that of such a similar known biomolecule" (Ottice 
Action, pages 9 and 10. ) However. Applicants submit that the polypeptide variant sequences and 
polynucleotide variant sequences are defined by their being "naturally occurring" and by their 
percentage sequence identity with SEQ ID NO: 1 and SEQ ID NO: 2 and not by biological function. 
The choice of amino acids or nucleotides to alter is made by nature. "Naturally occurring" polypeptidi 
variant sequences and polynucleotide variant sequences occur in nature; they are not created 
exclusively in a laboratory. The Specification teaches how to find polynucleotide variants (e.g., page 
35, lines 3-15) which can then be expressed to make polypeptide variants and how to use BLAST 
methods to determine whether a given naturally occurring polynucleotide sequence falls within the "at 
least 90 ( 7( identical to a polynucleotide sequence of SEQ ID NO^' scope and whether a given 
naturally occurring amino acid sequence falls within the "at least 9Q7< identical to an ammo acid 
sequence of SEQ ID NO:T scope (e.g.. page 41, line 28 through page 42, line 13). In addition, 
determination of percent identity is well known in the art. 

The making of the claimed polynucleotides by recombinant and chemical synthetic methods is 
disclosed in the Specification, at, e.g., page 13. lines 20-25, page 16. lines 10-21, and page 17. lines 
13-15 

'fins satisfies the "how to make" requirement of 35 L'.S.C. $ 1 12. first paragraph. 
B. How to Ise 

The rejection of (Maims 2-4 and 8- ( > is improper, as the inventions ot those claims are 
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The invention at issue is a polynucleotide sequence corresponding to a gene that is expressed in 
human bladder tumor tissue, as well as polynucleotides encoding SEQ ID NO: 1 variants, variants of the 
SHQ ID NO:2 polynucleotide, complementary polynucleotides, and ribonucleotide equivalents 
polynucleotides (hereinafter "the claimed polynucleotides"). The SHQ ID NO: 2 polynucleotide codes 
for a polypeptide demonstrated in the patent specification to be a member of the class of glutathione s- 
transierases. w hose biological functions include deactivation and detoxification of potentially mutagenic 
and carcinogenic chemicals. (Specification, pages 1-3.) The claimed invention has numerous practical, 
beneficial uses in toxicology testing, drug development, and the diagnosis of disease, none of which 
requires knowledge of how the polypeptides coded for by the claimed polynucleotides actually 
function As a result of the benefits of these uses, the claimed invention already enjoys significant 
commercial success. 

Applicants submit with this Response the Declaration of Dr. Tod Bedilion 4 describing some of 
the practical uses of the claimed invention in gene and protein expression monitoring applications. The 
Bedilion Declaration demonstrates that the positions and arguments made by the Patent Examiner with 
respect to the enablement and utility of the claimed polynucleotides are without merit. 

The Bedilion Declaration describes, in particular, how the claimed expressed polynucleotides 

can be used in gene expression monitoring applications that were well-known at the time the patent 

application was filed, and how those applications are useful in developing drugs and monitoring their 

activity. Dr. Bedilion states that the claimed invention is a useful tool when employed as a highly 

specific probe in a cDNA microarray: 

Persons skilled in the art would [have appreciated on November 2b, lWf)] that cDNA 
microarrays that contained the claimed polynucleotides would be a more useful tool than cDNA 
microarray s that did not contain the polynucleotides in connection w ith conducting gene 
expression monitoring studies on proposed (or actual) drugs for treating cancer for such 
purposes as evaluating their efficacy and toxicity. < Bedilion Declaration ( || IS. i 
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The Patent F.xaminer contends that the claimed polynucleotides cannot be useful without 
precise knowledge of their biological function. But the law never has required know ledge of biological 
function to prove utility. It is the claimed invention's uses, not its functions, that are the subject of a 
proper analysis under the utility requirement. 

In any event, as demonstrated by the Hedilion Declaration, the person of ordinal} skill in the art 
can achieve beneficial results from the claimed polynucleotides in the absence of any know ledge as to 
the precise function of the proteins encoded hv them. The uses of the claimed polynucleotides in gene 
expression monitoring applications are in fact independent of their precise function. 

1. The Applicable Legal Standard 

To meet the utility requirement of sections 101 and 1 12 of the Patent Act, the patent applicant 

need only show that the claimed invention is "practically useful," Anderson v. Nana. 480 F.2d 1392. 

1397, 178 USPQ 458 (CCPA 1973) and confers a "specific benefit" on the public. Brenner v. 

Mansotc 383 U.S. 519, 534-35. 148 USPQ 689 ( 1966). As discussed in a recent Court of Appeals 

for the Federal Circuit case, this threshold is not high: 

An invention is "useful" under section 101 if it is capable of providing some identifiable 
benefit. See Brenner v. Mansotc 383 U.S. 519, 534 [ 148 USPQ 689| ( 1966); 
Brooktree Corp. v. Advanced Micro Devices, Inc., 977 F.2d 1555, 1571 [24 
USPQ2d 1401 1 (Fed. Cir. 1992) ("to violate Section 101 the claimed device must be 
totally incapable of achieving a useful result"); Fuller w Beri>ei\ 120 F. 274, 275 (7th 
Cir. 1903) (test for utility is whether invention 'is incapable of serving any beneficial 
end"). 

Juicx Whip Inc. v. Orange Ban<> Inc.. 51 USPQ2d 1700 (Fed. Cir. 1999) 

While an asserted utility must he described with specificity, the patent applicant need not 

demonstrate utility to a certainty. In Snjtun^ \ . Ri nishau PLC 945 l\2d 1 173, 1 ISO. 20 UNPQ2d 

1094 (bed. Cir. 199 h, the United States Court of Appeals for the Federal Circuit explained: 

An invention need not be the best or only way to accomplish a certain result, and it 
need onI\ be useful to some extent and in certain applications: "|T|he fact that an 
invention has only limited utility and is onl\ operable in certain applications is not 
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The specificity requirement is not. therefore, an onerous one. It the asserted utility is described 
so that a person of ordinary skill in the art would understand how to use the claimed invention, it is 
sufficiently specific. See Standard Oil Co. v. Montedison. S.p.a..l\l V S.P Q. 327. 343 (3d C'ir. 
1981 ). The specificity requirement is met unless the asserted utility amounts to a "nebulous expression"" 
such as "biological activity"" or "biological properties"" that does not convey meaningful information 
about the utility of what is being claimed. Cross v. lizuka. 753 F.2d 1040. 1048 (bed. Cir. 1985 >. 

In addition to conferring a specific benefit on the public, the benefit must also be "substantial." 
Brenner. 383 U.S. at 534. A "substantial" utility is a practical, "real-world"" utility. Nelson v. Bowler. 
626 F.2d 853. 856. 206 USPQ 881 (CCPA 1980). 

Iftwsons of ordinary skill in the art would understand that there is a "well-established"" utility 
tor the claimed invention, the threshold is met automatically and the applicant need not make any 
showing to demonstrate utility. Manual of Patent Examination Procedure at § 706.03( a). Only it there 
is no "well-established" utility tor the claimed invention must the applicant demonstrate the practical 
benefits of the invention. Id 

Once the patent applicant identifies a specific utility, the claimed invention is presumed to 
possess it. In re Cortright. 165 F.3d 1353, 1357, 49 USPQ2d 1464 (Fed. Cir. 1999); In re Brana. 
51 F.3d 1560, 1566; 34 USPQ2d 1436 (Fed. Cir. 1995). In that case, the Patent Office hears the 
burden of demonstrating that a person of ordinary skill in the art would reasonably doubt that the 
asserted utility could be achieved by the claimed invention. Id To do so, the Patent Office must 
provide evidence or sound scientific reasoning. See In re Lander. 503 F.2d 1380, 1391-92, 183 
CSPQ 2S8 (CCPA 1974). If and only if the Patent Office makes such a showing, the burden shifts to 
the applicant to provide rebuttal evidence that would convince the person of ordinal} skill that there is 
sufficient proof of utility. Brana. 51 K3d at 1566. The applicant need onl\ prove a "substantial 
likelihood" of utility; certainty is not required. Brenner. 383 I'.S. at 532. 

2. Ises of the claimed polynucleotides tor diagnosis ot conditions and 

disorders characterized hv expression oflKJST. for toxicology testing. 
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The claimed invention meets all ot the necessary requirements tor establishing a credible utility 
under the Patent Law : There are "well-established" uses tor the claimed invention know n to persons ot 
ordinal) skill in the art. and there are specific practical and beneficial uses tor the invention disclosed in 
the [nitent application's specification. These uses are explained, in detail, in the Bedilion Declaration 
accompanying this Response. Objective evidence, not considered by the Patent Office, further 
corroborates the credibility ot the asserted utilities. 

a. The uses of the claimed polynucleotides for toxicology testing, 
drug discovery, and disease diagnosis are practical uses that 
confer "specific benefits" to the public 

The claimed invention has specific, substantial, real-world utility by virtue ot its use in toxicology 
testing, drug development and disease diagnosis through gene expression profiling. These uses are 
explained in detail in the accompanying Bedilion Declaration. The claimed invention is a useful tool in 
cDNA microarrays used to perform gene expression analysis. That is sufficient to establish utility tor 
the claimed polynucleotides. 

In his Declaration, Dr. Bedilion explains the many reasons why a person skilled in the art 
reading the Goli 771 application on November 26, 1996 would have understood that application to 
disclose the claimed polynucleotides to be useful for a number of gene expression monitoring 
applications, as highly specific probes for the expression of those specific polynucleotides in 
connection with the development of drugs and the monitoring of the activity of such drugs. (Bedilion 
Declaration at. e.g., ( ][ ( |[ 10-15). Much, but not all. ot Dr. Bedilion' s explanation concerns the use of the 
claimed polynucleotides in cDN A microarravs of the tvpe first developed at Stanford I'niversitv for 
evaluating the efficacy and toxicity of drugs, as well as tor other applications, i Bedilion Declaration. '|H| 
12 and 15 i. 



Dr. Bedilion also explained, tor example, w hv persons skilled in the art would also appreciate. 

f;-t;-"i.,v t n >l,. i i - a. ... . • t . ,i. i K- ■ ,,,.(',,! in o "in, v t i, , 
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In connection with his explanations. Dr. Bedilion states that the "Goli '771 application would 
have led a person skilled in the art on November 26. 1996 who was using gene expression monitoring 
in connection with working on developing new drugs tor the treatment ot cancer to conclude that a 
cDNA mieroarray that contained the claimed polynucleotides would be a highly useful tool and to 
request specifically that any cDNA mieroarray that was being used tor such purposes contain the 
claimed polynucleotides." (Bedilion Declaration, *|[ 15 ). For example, as explained by Dr. Bedilion. 
"|p|ersons skilled in the art would [have appreciated on November 26, 1996| that cDNA microarravs 
that contained the claimed polynucleotides would be a more useful tool than cDNA microarravs that 
did not contain the polynucleotides in connection with conducting gene expression monitoring studies on 
proposed (or actual) drugs for treating cancer for such purposes as evaluating their efficacy and 
toxicity/* hi 

In support of those statements, Dr. Bedilion provided detailed explanations of how cDNA 
technology can be used to conduct gene expression monitoring evaluations, with extensive citations to 
pre-November 26, 1996 publications showing the state of the art on November 26, 1996. (Bedilion 
Declaration, ( || ( |[ 10-14). While Dr. Bedilion' s explanations in paragraph 15 of his Declaration include 
three pages of text and six subparts (a)-(f), he specifically states that his explanations are not "all- 
inclusive." hi For example, with respect to toxicity evaluations. Dr. Bedilion had earlier explained 
how persons skilled in the art who were working on drug development on November 26, 1996 (and 
for several years prior to November 26, 1996) "without any doubt" appreciated that the toxicity (or 
lack of toxicity) of any proposed drug was "one of the most important criteria to be evaluated in 
connection with the development of the drug" and how the teachings of the Cioli '771 application clearly 
include using differential gene expression analvses in toxicity studies ( Bedilion Declaration, f |[ 10). 

Thus, the Bedilion Declaration establishes that persons skilled in the art leading the Cioli '771 
application at the time it was tiled "would have wanted their cDNA mieroarray to have a | claimed 
polvnucleotide probe | because a mieroarray that contained such a probe (as compared to one that did 
not ) would pro\ide more useful results in the kind of gene expression monitoring studies using cDNA 
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conclusion that the Goli '771 application disclosed to persons skilled in the art at the time of its filmy 
substantial, specific and credible real-world utilities for the claimed polynucleotides. 

Now here does the Patent Examiner address the fact that, as described on pages 22 anil 35 of 
the Goli "739 application, the claimed polynucleotides can be used as highly specific probes in, for 
example, chip based technologies [cDNA microarra} s| - probes that without question can be used to 
measure both the existence and amount of complementary RNA sequences know n to be the expression 
products of the claimed polynucleotides. The claimed invention is not, in that regard, some random 
sequence w hose value as a probe is speculative or would require further research to determine. 

Given the fact that the claimed polynucleotides are known to be expressed, their utility as a 
measuring and analyzing instrument for expression levels is as indisputable as a scale's utility for 
measuring weight. This use as a measuring tool, regardless of how the expression level data ultimately 
would be used by a person of ordinary skill in the art, by itself demonstrates that the claimed invention 
provides an identifiable, real-world benefit that meets the utility requirement. Raytheon v. Roper. 724 
F.2d 951, (Fed. Cir. 1983) (claimed invention need only meet one of its stated objectives to be useful); 
In re Cortwright. 165 F.3d 1353, 1359 (Fed. Cir. 1999) (how the invention works is irrelevant to 
utility); MPEP § 2107 ( "Many research tools such as gas chromatographs. screening assays, and 
nucleotide sequencing techniques have a clear, specific, and unquestionable utility (e.g., they are useful 
in analyzing compounds )" (emphasis added)). 

The Bedilion Declaration shows that a number of pre-November 26, 1996 publications confirm 
and further establish the utility of cDNA microarrays in a wide range of drug development gene 
expression monitoring applications at the time the Goli '77 1 application was filed ( Bedilion Declaration 

10 14; Bedilion Exhibits A Gi. Indeed, Brown and Shalon I S Patent No 5.S07.S22 (the Brown 
'522 patent, Bedilion Exhibit Dt. which issued from a patent application tiled in June 1995 and was 
effectively published on December 29. 1995 as a result of the publication of a PCT counterpart 
application, shows that the Patent Office recognizes the patentable utility ot the cDNA technology 
dc\clo|x*d in the earl\ to mid 1990s. As explained In Dr. Bedilion, among other things ( Bedilion 
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The Brown '522 patent further teaches that the "| m|icroarrays of "immobilized 
nucleic acid sequences prepared in accordance with the invention" can be used in 
"numerous" genetic applications, including "monitoring of gene expression" applications 
(see Bcdihon Tab I) at col. 14, lines 36-42). The Brown '522 patent teaches (a) 
monitoring gene expression m in different tissue types, (in m different disease states, 
and dii) in response to different drugs, and (b> that arrays disclosed therein may be used 
in toxicology studies (see Bedilion Tab D at col. IS. lines 13- IS and 52-58 and col. 
18. lines 25-30). 

Literature reviews published after the tiling of the (Joli '771 application describing the state of 

the art further confirm the claimed invention's utility. Rockett et al. confirm, tor example, that the 

claimed invention is useful for differential expression analysis regardless of how expression is regulated: 

Despite the development of multiple technological advances w hich have recently 
brought the held of gene expression profiling to the forefront of molecular analysis, 
recognition of the importance of differential gene expression and characterization of 
differentially expressed genes has existed for many years. 



Although differential expression technologies are applicable to a broad range of models, 
perhaps their most important advantage is that, in most cases, absolutely no prior 
know ledge of the specific genes w hich are up- or down-regulated is required. 



Whereas it would be informative to know the identity and functionality of all genes 
up/dow n regulated by . . . toxicants, this would appeal' a longer term goal .... 
However, the current use of gene profiling yields a pattern of gene changes for a 
xenobiotic of unknown toxicity w hich may be matched to that of well characterized 
to Mils, thus alerting the toxicologist to possible in \ ivo similarities between the Link now n 
and the standard, thereby providing a platform for more extensive toxicological 
examination i emphasis in original. ) 

John (\ Rockett. et. al.. Differential gene expression in drug metabolism and toxicology: p ractical ities, 
probl e ms, a nd pote ntial. Xcnohiotica 2 ( ) 655 6 C ) 1 (Julv 1 ( ) ( )Q> (Reference No 2>: 
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In another post-November 26. article, Lashkari et al. state explicitly that sequences that 

are merelv "predicted" to be expressed (predicted Open Reading Frames, or ORB) - the claimed 

invention m tact is known to be expressed - have numerous uses: 

Efforts have been directed toward the amplification of each predicted ORF or any 
other rem ion of the genome ranging from a few base pairs to several kilobase pairs. 
There are many uses for these amplicons- they can be cloned into standard vectors or 
specialized expression vectors, or can be cloned into other specialized vectors such as 
those used tor two-hybrid analysis. The amplicons can also be used directly by, for 
example, arraying onto idass for expression analysis , for DNA binding assays, or tor 
any direct DNA assay. 

Lashkari. et al.. Whole uenome analysis: Experimental access to all genome sequence d segments 
through lamer-scale efficient oligonucleotide synthesis and PGR , ( August 1997) Proc. Nat. Acad. Scl 
U.S.A. 94:8945-8947 (Reference No. 3). (emphasis added.) 

b. The use of nucleic acids coding for proteins expressed by 

humans as tools for toxicology testing, drug discovery, and the 
diagnosis of disease is now "well-established" 

The technologies made possible by expression profiling and the DNA tools upon which they 
rely are now well-established. The technical literature recognizes not only the prevalence of these 
technologies, but also their unprecedented advantages in drug development, testing and safety 
assessment. These technologies include toxicology testing, as described by Bedilion in his Declaration. 

Toxicology testing is now standard practice in the pharmaceutical industry. See, e.g.. John C. 
Rockctt et al., (Reference No. 2, supra):: 

Knowledge of toxin-dependent regulation in target tissues is not solely an academic 
pursuit as much interest has been generated in the pharmaceutical industry to harness 
this technolog\ in the carl} identification of toxic drug candidates, thereby shortening the 
developmental process and contributing substantially to the safety assessment ot new 
drugs. (Reference No. 2. page 656) 

To the same effect are several other scientific publications, including Hmilc f Nuwavsir. et al.. 

\ j . . .... . . I ' 1 • . , : 1 T!,. \ -i' - 1 ' !' i v > : • : i ■ > i ■ n i m n k v X 1 - U v M 1 i r ( ' ire i ' i- k ' e! H ' si s ^ 4 - 1 S ^ 1 V) 
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( 1999) (Reference No. 4); Sandra Steiner and N. heigh Anderson. Hxpression profiling in toxicology 

-- potentials and limitations . Toxicology Letters 112-13:467-471 (2000) ( Reference No. 5). . 

Nucleic acids useful for measuring the expression of w hole classes of genes are routinely 

incorporated for use in toxicology testing. Nuwaysir et al. describes, for example, a Human ToxChip 

comprising 2089 human clones, w hich w ere selected 

for then" well- documented involvement in basic cellular processes as well as then' 
responses to different types of toxic insult. Included on this list are DNA replication 
and repair genes, apoptosis genes, and genes responsive to PAHs and dioxin-like 
compounds, peroxisome proliferators, estrogenic compounds, and oxidant stress. 
Some of the other categories of genes include transcription factors, oncogenes, tumor 
suppressor genes, cyclins, kinases, phosphatases, cell adhesion and motility genes, and 
homeobox genes. Also included in this- group are 84 housekeeping genes, whose 
hybridization intensity is averaged and used for signal normalization of the other genes 
on the chip. (Reference No. 4, page 156.) 

See also Table 1 of Nuwaysir et al. (listing additional classes of genes deemed to be of special interest 
in making a human toxicology niicroarray). 

The more genes that are available for use in toxicology testing, the more powerful the technique. 
"Arrays are at their most powerful when they contain the entire genome of the species they are being 
used to study." John C. Rockett and David J. Dix. Application of DNA Arrays to Toxicology , 
Hnviron. Health Perspec. 107:681-685 ( 1999) (Reference No. 6, see page 683). Control genes are 
carefully selected for their stability across a large set of array experiments in ol der to best study the 
effect of toxicological compounds. See attached email from the primary investigator on the Nuwaysir 
paper. Dr. Cynthia Afshari. to an Incyte employee, dated hilv 3. 2000. as well as the original message 
to w Inch she w as responding i Reference No 7 i. indicating that even the expression of carefully 
selected control genes ean be altered. Thus, there is no expressed gene which is irrelevant to screening 
for toxicological effects, and all expressed genes have a utility for toxicological screening. 

In tact, the potential benefit to the public, in terms of lives saved and reduced health care costs, 
are enormous. Recent developments pro\ ide e\ ldence that the benefits of this in tun nation are alivad\ 
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• In 1 ( ) C H). (A' Therapeutics, an Incvte collaborator, was able to use Incvte gene 
expression technology, information about the structure of a known transporter gene, 
and chromosomal mapping location, to identify the key gene associated with Tangier 
disease. This discovery took place over a matter of only a few w eeks, due to the 
power ot these new genomics technologies. The discovery received an award from the 
American Heart Association as one of the top 10 discoveries associated w ith heart 
disease research in l c ) ( )9. 

• In an April ( ), 2000, article published by the Bloomberg news service, an Incvte 
customer stated that it had reduced the time associated with target discoverv and 
validation from 36 months to 1 8 months, through use of Incyte's genomic information 
database. Other Incyte customers have privately reported similar experiences. The 
implications of this significant sav ing of time and expense for the number of drugs that 
may be developed and their cost are obvious. 

In a February 10, 2000, article in the Wall Street Journal, one Incyte customer stated 
that over 50 percent of the drug targets in its current pipeline w ere derived from the 
Incyte database. Other Incyte customers have privatelv reported similar experiences. 
By doubling the number of targets available to pharmaceutical researchers, Incyte 
genomic information has demonstrably accelerated the development of new drugs. 

Because the Patent Examiner failed to address or consider the "well-established" utilities for the 
claimed invention in toxicology testing, drug dev elopment, and the diagnosis of disease, the Examiner's 
rejections should be withdraw n regardless of their merit. 

c. The similarity of the polypeptides encoded by the claimed 
invention to another polypeptide of undisputed utility 
demonstrates utility 

In addition to hav mg substantial, specific and credible utilities in numerous gene expression 
monitoring applications, the utility ot the claimed polv nucleotide v ariants can be imputed based on the 
relationship between the polypeptides they encode and another polypeptide of unquestioned utilitv. the 
ShQ ID NO: 1 polypeptide. The polypeptides have sufficient similarities in their sequences that a 
person ot ordinary skill in the art would recogni/e more than a reasonable probabilitv that the 
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It is undisputed, and readily apparent from the patent application, that polypeptides encoded hv 
the claimed polynucleotides share more than Wi sequence identity o\er 222 amino acid residues with 
the SEQ II) NO: 1 polypeptide. This is more than enough homology to demonstrate a reasonable 
probability that the utility of the SEQ ID NO: 1 polypeptide can be imputed to the claimed 
polynucleotides (through the polypeptides they encode). It is well-known that the probability that two 
unrelated polypeptides share more than 40 r r sequence homology oyer 70 amino acid residues is 
exceedingly small. (Brenner et al., supra. Reference No. 1 ). Given homology in excess of 40 f A over 
many more than 70 amino acid residues, the probability that the polypeptide encoded for by the 
claimed polynucleotide is related to the SHQ ID NO:l polypeptide is, accordingly, very high. 

The Examiner must accept the .Applicants' demonstration that the homology between the 
polypeptides encoded by the claimed invention and the SEQ ID NO: 1 polypeptide demonstrates utility 
by a reasonable probability unless the Examiner can demonstrate through ev idence or sound scientific 
reasoning that a person of ordinary skill in the art would doubt utility. See /// re Lunger* 503 F.2d 
1380, 1391-92, 183 USPQ 288 (CCPA 1974). The Examiner has not provided sufficient evidence or 
sound scientific reasoning to the contrary. 

While the Examiner has cited literature (Russell. J. Mol. Bio. 244:332-350, Skolnick et al. 
Trends in Biotech. 18(1 ):34-39, and Attwood \sic\ At\vood| Science 290: 471-473, 29 October 
2000) identifying some of the difficulties that may be inv olved in predicting protein function, none 
suggests that functional homology cannot be inferred by a reasonable probability in this case. Most 
important, none contradicts Brenner's basic rule that sequence homology in excess of 40 f /f over 70 or 
more amino acid residues yields a high probability of functional homology as well. At most, these 
article^ individually and together stand for the proposition that it is difficult to make predictions about 
function with certainty. The standard applicable in this ease is not. however, proof to certainty, but 
rather proof to reasonable probability. 

The Examiner contends that the use of sequence identity of the SEQ ID NO:2 polynucleotide to 
tin- claimed polynucleotide variants is insufficient to idciitih the tuuction ot the claimed protein. relviiiL! 
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neither evidence nor sound scientific reasoning to support the allegations that the claimed 

polynucleotides lacked adequate enablement. The Fxaminer made only the hare statement that "|s|ee 

the following publications that support this unpredictahilitv as well as noting certain conserved 

sequences in limited specific cases." (Office Action, page 10. ) The Hxaminer pointed to no specific 

part of "any of the cited documents that supported the allegations of lack of 'enablement. 

Furthermore, it is well known in the art that sequence similarity (measured bv statistical scores 

as in Brenner) is predictive of similarity in functional activity. II. Hegyi and M. Gerstein ("The 

Relationship between Protein Structure and Function: a Comprehensive Survey with Application to the 

Yeast Genome." J. Mol. Biol. ( 1999) 288:147-164; Reference No. 8) state that "the proportion of 

liomoiogucs with different functions is around \Q ( A . This shows that there is a low chance that a 

single-domain protein, highly homologous to a known enzyme, has a different function. (Hegyi 

and Gerstein, Reference No. 8, page 159, column 1, emphasis added.) Furthermore, Hegyi and 

Gerstein in a second journal article (H. Hegyi and M. Gerstein, "Annotation Transfer for Genomics: 

Measuring Functional Divergence in Multi-Domain Proteins," Genome Research (2001 ) 1 1: 

1632-1640; Refei dice No. 9) conclude that the probability that two single-domain proteins that have 

the same superfamily structure have the same function (whether enzymatic or not) is about 2/3." f Hegvi 

and Gerstein, Reference No. 9, page 1635.) Hegyi and Gerstein also concluded that, for multi-domain 

proteins with "almost complete coverage with exactly the same type and number of superfamilies, 

follow ing each other in the same order" "|t|he probability that the functions are the same in this case was 

9Fr ." (Hegyi and Gerstein, Reference No. 9, page 1636.) Hegyi and Gerstein (Reference No. 9. 

page 1632 > ) further note that 

Wilson et al. ( 2000) compared a large number of protein domains to one another in a 
pair-wise fashion with respect to similarities in sequence, structure, and function. I'sm^ 
a hybrid functional classification scheme merging the FN/YMF and FKBasc systems 
(Gelbart et al. 1997; Bairoch 2000), they found that precise function is not conserved 
below 30 — W ( identity, although the broad functional class is usually preserved for 
sequence identities as low as 20 25 r t . given that the sequences ha\e the same fold. 
I heir sur\e\ also reinforced the pre\iousl\ established general exponential relationship 
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As noted supra the polypeptides encoded hy the claimed polynucleotides share more than 
t )() r f sequence identity with the SHQ II) NO: 1 polypeptide, well above the thresholds described in the 
Hegyi anil (ierstein (ienome Research article (Reference No. c ) > cited abo\e. Therefore, there is a 
reasonable probability that the utility of the SHQ II) NO: I polypeptide can be imputed to the claimed 
polynucleotides. 

Applicants further submit that both the Revised Interim Utility Guidelines and the Revised 

Interim Utility Guidelines Training Materials support the use of sequence homology to known proteins 

to establish functional homology. The Revised Interim I'tility Guidelines specifically state at page 1096. 

that the Fxaminefs decision to rebut Applicants assertion of utility: 

---must be supported by a preponderance of all evidence of record. More specifically, 
when a patent application claiming a nucleic acid asserts a specific, substantial, and 
credible utility, and bases the assertion upon homology to existing nucleic acids or 
proteins hav ing an accepted utility, the asserted utility must be accepted by the 
Fxaminer unless the Office has sufficient evidence or sound scientific reasoning to rebut 
such an assertion. "[ A] 'rigorous correlation' need not be shown in order to establish 
practical utility: 'reasonable correlation' is sufficient", (emphasis added). 

Clearly the PTO recognizes the well known use of sequence homology in the art to establish 
protein function. The Rev ised Interim Utility Guidelines Training Materials elaborate further on this 
matter in Example 10: DNA Fragment encoding a Full Open Reading frame (ORF) at page 53, which 
recites a claim to a nucleic acid encoding a protein with 95 ( /i sequence identity to a known protein (a 
DNA ligase). The example clearly states that "there is no reason to doubt the assertion that [the 
claimed sequence] encodes a DNA ligase." Therefore the Revised Interim Utility Guidelines Training 
Material indicate that a sequence similarity of less than 100^ is deemed reasonabK to support to one 
skilled in the art that two molecules could possess the same activ !t\ 

Moreover. In a recent Federal Circuit dec is. ion i Boehriuger lngclheim Yctmcdicu. Inc. v . 
Schering-Plough ( orporation and Schering Corporation: ( \ \F( ' 02-1026. - 1027. February 2 1 , 
200 o ). the ( 'ourt stated that "the uncontrov ei sial tact that ev en a single nucleotide or ammo acid 
substitution mav drastically alter the function of a ijene or protein is not evidence of anything at all. The 
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mere possibility that a single mutation could affect biological function cannot as a matter of law preclude 
an assertion of equivalence." 

d. Objective evidence corroborates the utilities of the claimed 
invention 

There is. in fact, no restriction on the kinds of evidence a Patent l:\aminer mav consider in 
determining whether a "real-world" utility exists. Indeed, "real-world" evidence, such as e\idence 
show ing actual use or commercial success of the invention, can demonstrate conclusive proof of utility. 
Raytheon v. Roper, 220 l\SPQ2d 592 (Fed. Cir. 1983); Nestle v. Fui-ene. 55 F 2d 854. 856. 12 
USPQ 335 (6th (Mr. 1932). Indeed, proof that the invention is made, used or sold by any person or 
entity other than the patentee is conclusive proof of utility. United States Steel Corp. v. Phillips 
Petroleum Co., 865 F.2d 1247, 1252, 9 USPQ2d 1461 (Fed. Cir. 1989). 

Over the past several years, a vibrant market has developed for databases containing the 
sequences of all expressed genes (along w ith the polypeptide translations of those genes), in particular 
genes having medical and pharmaceutical significance such as the instant sequences. ( Note that the 
value in these databases is enhanced by their completeness, but each sequence in them is independently 
valuable. ) The databases sold by Applicants' assignee. Incyte, include exactly the kinds of information 
made possible by the claimed invention, such as tissue and disease associations. Incyte sells its 
database containing the sequences of the claimed polynucleotides and millions of other sequences 
throughout the scientific community, including to pharmaceutical companies w ho use the information to 
develop new pharmaceuticals. 

Both Incyte's customers and the scientific community have acknow ledged that Incyte \s 
databases have proven to be valuable in. tor example, the identification and development of drug 
candidates. As Incvte adds intormation to its databases, including the information that can be generated 
only as a result of Incyte's discover} of the claimed polynucleotides and its use of those polynucleotides 
on cDNA microarrays. the databases become even more powerful tools Minis the claimed invention 
adds more than incremental benefit to the drug discoverv and development process. 
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3. The Patent Examiner's Rejections Are Without Merit 

Rather than responding to the evidence demonstrating utility, the Examiner attempts to dismiss it 
altogether by arguing that the disclosed anil well-established uses tor the claimed polynucleotides are 
not enabled. (Office Action at pages 9-10.) The Hxaminer is incorrect both as a matter of law and as 
a matter of tact. 

a. The Precise Biological Role Or Function Of An Expressed 
Polynucleotide Is Not Required To Demonstrate Utility 

The Patent Examiner's primary rejection of the claimed invention is based on the ground that, 
w ithout information as to the precise "biological function" ot the claimed invention, the claimed 
invention's utility is not sufficiently specific. According to the Examiner, it is not enough that a person of 
ordinary skill in the art could use and, in fact, would want to use the claimed invention either by itself or 
in a cDNA microarray to monitor the expression of genes tor such applications as the evaluation of a 
drug's efficacy and toxicity. The Examiner would require, in addition, that the applicant provide a 
specific and substantial interpretation of the results generated in any given expression analysis. 

It may be that specific and substantial interpretations and detailed information on biological 
function are necessary to satisfy the requirements for publication in some technical journals, but thev are 
not necessary to satisfy the requirements for obtaining a United States patent. The relevant question is 
not, as the Examiner would have it, w hether it is known how or why the invention works. In rc 
Corturixht. 165 E 3d 1353. 1359 ( Fed. Cir. 1999). but rather whether the invention provides an 
"identifiable benefit" in presently available form. Juu x Whip Inc. \. Orange Inuwi Inc.. 1S5 E.3d 
1 304. 1 3bb i bed. ( 'ir. 1999 > If the benefit exists, and there is a substantial likelihood the invention 
provides the benetit. it is useful. There can be no doubt, particularlv in \ leu ot the Bedilion Declaration 
fat, ( || { |[ 10 and 15, Bedilion), that the present invention meets this test. 

The threshold tor determining whether an invention produces an identifiable benefit is low 
huc\ U '///)>. I S5 l ; .3d at 1 3o(v ( )nlv those utilities that are so nebulous that a person ot ordinarv skill 
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guidelines, so-called "throw aw ay" utilities that are not directed to a person of ordinary skill in the art at 
all. do not meet the statutory requirement of utility. Utility Examination Guidelines. 60 Fed. Reg. 1092 
(Jan. 5. 2001 ). 

Know ledge of the biological function or role of a biological molecule has never been required to 

show real-w orld benefit. In its most recent explanation of its own utility guidelines, the PTO 

acknowledged so much (66 F.R. at 1095): 

|T|he utility of a claimed DNA does not necessarily depend on the function of the 
encoded gene product. A claimed DNA may have specific and substantial utility 
because. it hybridizes near a disease-associated gene or it has gene-regulating 
activity. 

By implicitly requiring knowledge of biological function tor any claimed nucleic acid, the 
Examiner has, contrary to law, elevated what is at most an evidentiary factor into an absolute 
requirement of utility. Rather than looking to the biological role or function of the claimed invention, the 
Examiner should have looked first to the benefits it is alleged to provide. 

b. Membership in a Class of Useful Products Can Be Proof of 
Utility 

Despite the uncontradicted evidence that the claimed polynucleotides encode polypeptides in 
the glutathione s-transferase family and the family of expressed polypeptides, the Examiner refused to 
impute the utility of the members of the glutathione s transferase family and the family of expressed 
polypeptides to the polypeptides encoded by the claimed polynucleotides. 

In order to demonstrate utility by membership in a class, the law requires only that the class not 
contain a substantial number of useless members. So long as the class does not contain a substantial 
number of useless members, there is sufficient likelihood that the claimed invention will have utility, and 
a rejection under 35 U.S.C. § 101 is improper. That is true regardless of how the claimed invention 
ultimately is used and w hether or not the members of the class possess one utility or many. Sec 
Brenner v. MansofC 5S5 U.S. 5 532 i 1 <)(>() i; Application oj Kirk. Mb I 2d l )5(>. ( )45 (('('PA 
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Membership in a "general" class is insufficient to demonstrate utility only it the class contains a 
sufficient number of useless members such that a person of ordinary skill in the art could not impute 
utility by a substantial likelihood. There would be. in that case, a substantial likelihood that the claimed 
invention is one of the useless members of the class. In the few cases in which class membership did 
not pro ye utility by substantial likelihood, the classes did in fact include predominately useless members. 
Brenner ( man-made steroids); Kirk (same); Ndfta (man-made polyethylene polymers). 

The Examiner addresses the polypeptides encoded by the claimed polynucleotides as if the 
general classes in which they are included are not the glutathione s-transferase family anil the family of 
expressed polypeptides, but rather all polynucleotides or all polypeptides, including the yast majority of 
useless theoretical molecules not occurring in nature, and thus not pre-selected by nature to be useful. 
While these "general classes'' may contain a substantial number of useless members, the glutathione s- 
transferase family and the family of expressed polypeptides do not. The glutathione s-transferase family 
and the family of expressed polypeptides are sufficiently specific to rule out any reasonable possibility 
that the polypeptides encoded by the claimed polynucleotides would not also be useful like the other 
members of the family. 

Because the Examiner has not presented any evidence that the glutathione s-transferase family 
and the family of expressed polypeptides have any, let alone a substantial number, of useless members, 
the Examiner must conclude that there is a "substantial likelihood" that the polypeptides encoded by the 
claimed polynucleotides are useful. It follows that the claimed polynucleotides also are useful. 

c. Because the uses of the claimed polynucleotides in toxicology 

testing, drug discovery, and disease diagnosis are practical uses 
beyond mere study of the invention itself, the claimed invention 
has substantial utility. 

As used in toxicology testing, drug discovery, anil disease diagnosis, the claimed invention has a 
beneficial use in research other than studying the claimed invention or its protein products. It is a tool, 
rather than an object, ot research The data generated in gene expression monitoring using the claimed 
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study properties of tissues, cells, and potential drug candidates and toxins. Without ttie claimed 
invention, the information regarding the properties of tissues, cells, drug candidates and toxins is less 
complete. | Bedilion Declaration at ( || 15. | 

The claimed invention has numerous additional uses as a research tool, each of w hich alone is a 
"substantial utility/" These include diagnostic assays (e.g.. page 34. line 18 through page 37, line 1 1 ) 
and chromosomal mapping (e.g., page 37. line 12 through page 3S. line 10). 

d. The Patent Examiner Failed to Demonstrate That a Person of 

Ordinary Skill in the Art Would Reasonably Doubt the I tility of 
the Claimed Invention 

Based principally on citations to scientific literature identifving some of the difficulties involved i 
predicting protein function, the Examiner rejected the pending claims on the ground that the Applicants 
cannot impute utility to the claimed polynucleotides based on their 90 ( ( sequence similarity to the SEQ 
ID NO:2 polynucleotide. The Examiner's rejection is both incorrect as a matter of fact and as a matter 
of procedural law. 

As demonstrated in § VI.B.2.C, supra, the literature cited by the Examiner is not inconsistent 
with the Applicants' proof of homology by a reasonable probability. It may show that Applicants 
cannot prove function by homology with certainty, but Applicants need not meet such a rigorous 
standard of proof. Under the applicable law. once the applicant demonstrates a prima facie case of 
homology, the Examiner must accept the assertion of utility to be true unless the Examiner comes 
forward w ith evidence showing a person of ordinary skill w ould doubt the asserted utility could be 
achieved by a reasonable probability. Sec In re Brand, 51 I 3d at 15()(>; hi re Lan^a. 503 E 2d 
13S0. 13 ( >1 ( >2. 1 S3 I'SPQ 2SS iCCPA l<r4-. The Examiner has not made such a showing and. as 
such, the Examiner's rejection should be withdrawn. 

4. By Requiring the Patent Applicant to Assert a Particular or Inique 
I tility, the Patent Examination I'tility d uiclelines and Training 
Mate rials VpplicH h\ the Patent Examiner Misstate the ] aw 
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The 1 rt* is an additional independent reason to withdraw the rejections: to the extent the 
rejections are based on Revised Interim Utility Lxamination (imdelines (64 FR 71427. December 21, 
1W9>. the final I'tility Fxamination Guidelines (66 FR 1002. January 5. 2001 t and/or the Revised 
Interim Utility Guidelines Training Materials (LSPTO Website www.uspto.gov, March 1, 2000). the 
(imdelines and Training Materials are themselves inconsistent with the law. 

The Training Materials, which direct the Lxaminers regarding how to apply the I'tility 

Guidelines, address the issue of specificity with reference to two kinds of asserted utilities: "specific" 

utilities which meet the statutory requirements, and "general" utilities which do not. The Training 

Materials define a "specific utility" as follows: 

A. J specific utility] is specific to the subject matter claimed This contrasts to veneral 
utility that would be applicable to the broad class of invention. For example, a claim to 
a polynucleotide whose use is disclosed simply as "gene probe" or "chromosome 
marker" would not be considered to be specific in the absence of a disclosure of a 
specific DNA target. Similarly, a general statement of diagnostic utility, such as 
diagnosing an unspecified disease, would ordinarily be insufficient absent a disclosure of 
what condition can be diagnosed. 

The Training Materials distinguish between "specific" and "general" utilities by assessing 
whether the asserted utility is sufficiently "particular," i.e.. unique (Training Materials at p. 52) as 
compared to the "broad class of invention." (In this regard, the Training Materials appear to parallel 
the view set forth in Stephen G. Kunin, Written Description (imdelines and I'tility Guidelines , 82 
J.P.T.O.S. 77, 97 (Feb. 2000) ("With regard to the issue of specific utility the question to ask is 
whether or not a utility set forth in the specification is particular to the claimed invention.")). 

Such "unique" or "particular" utilities ne\er have been required by the law To meet the utility 
requirement, the iincntion need only be "practicallv useful." \affa, 480 F.2d I at 1 /W7 . and confer a 
"specific benefit" on the public Brenner. oSo I ' S. at S.^4. Thus, incredible "throwawav" utilities, such 
as trying to "patent a transgenic mouse by saying it makes great snake food." do not meet this standard. 
Karen Hall. Genomic Warfare . The American Lawyer 68 (June 2000) (quoting John Doll. Chief of the 
Biotech Section of I 'SLTO) 
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that arc unique to an invention. The law requires that the practical utility he "definite." not particular. 
Montedison 664 K2d at 375. Applicants are not aware of any court that lias rejected an assertion of 
utility on the grounds that it is not "particular" or "unique"" to the specific invention. Where courts have 
found utility to be too "general." it has been in those cases in which the asserted utility in the patent 
disclosure w as not a practical use that conferred a specific benefit. That is. a person of ordinary skill in 
the art would have been left to guess as to how to benefit at all from the invention. In Kirk, for 
example, the ('CPA held the assertion that a man-made steroid had "useful biological activity" was 
insufficient where there was no information in the specification as to how that biological activity could be 
practically used. Kirk. 376 I ; .2d at 941. 

The fact th.at an invention can b.ave a particular use does not provide a basis for reuuirinti a 
particular use. See Brana, supra (disclosure describing a claimed antitumor compound as being 
homologous to an antitumor compound having activ ity against a "particular"" type of cancer was 
determined to satisfy the specificity requirement ). "Particularity"* is not and never has been the sine qua 
non of utility; it is. at most, one of many factors to be considered. 

As described supra, broad classes of inventions can satisfy the utility requirement so long as a 
person of ordinary skill in the art would understand how to achiev e a practical benefit from know ledge 
of the class. Only classes that encompass a significant portion of nonuseful members would fail to meet 
the utility requirement. Supra § VI.B.3.K (Montedison, 664 F.2d at 374-75). 

The Training Materials fail to distinguish between broad classes that convey information of 
practical utility and those that do not, lumping all of them into the latter, unpatentable category of 
"general" utilities. As a result, the Training Materials paint with too broad a brush. Rigorously applied, 
thev u ould render unpatentable whole categories of in v el it ions that he ret of >re hav e been considered to 
be patentable and that have indisputably benefitted the public, including the claimed invention See 
supra §VI.B.3.b. Thus the Training Materials cannot be applied consistently with the law. 

VII. Rejection of Claims 2-4 Coder 35 C.S.C. $ 112, second paragraph 
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Claim 2 recites the limitation "an isolated polynucleotide encoding a polypeptide": and 
claim 3 recites the limitation *a recombinant polynucleotide' however claim 1 does not 
recite a polynucleotide. There is insufficient antecedent basis for this limitation in the 
claim. (Office Action, page 11.) 

In order to expedite prosecution. Claim 2 has been amended to an independent claim. (Maims 
3 anil 4 depend from independent Claim 2. For at least the above reasons. Applicants respectfully 
request that the Examiner withdraw the indefiniteness rejection. 

Mil. Rejection of C laim 8 I 'ncler the Judicially Created Doctrine of Obviousness-type 
Double Patenting 

The Examiner rejected Claim 8 under the judicially created doctrine of obviousness-tvpe 
double patenting as being unpatentable over Claims 3-4 of U.S. Patent No. 5,817.497 (Office Action, 
pages 11-12). Applicants request that the requirement for submission of a Terminal Disclaimer be held 
in abeyance until such time as there is an indication of allowable subject matter. 

IX. Rejection of Claims 8-9 Under 35 U.S.C. § 102(b) as Being Anticipated by Hillier et aL 

The Examiner rejected Claims 8- ( ) under 35 U.S.C. § 102(b) as being anticipated by Hillier et 
al. (Accession Number H27975 ). The Examiner alleged that "Hillier et aL teach a human cDNA clone 
that is similar to mouse glutathione transferase GST," that "(t|here are stretches of nucleic acids w ithin 
the sequence that are complementary to SEQ ID NO:2." and that "the sequence of Hillier et al.. recites 
a polynucleotide comprising at least 60 contiguous nucleic acids/' (Office Action, page 12. ) 

( laim c ) is canceled, anil Claim S is amended as follows: 

An isolated polynucleotide comprising a sequence selected from the LM'oup consisting of: 
a > a polynucleotide sequence of SEQ ID NO:2. 

hi a naturally -occurring pol\ nucleot ide sequence having at least { HY i sequence 
identity to the sequence of SEQ ID NO:2, over the entire length of SEQ ID 
N():2. 

o a polynucleotide sequence completely complementar\ to at. 

d) a polynucleotide sequence completely complementar\ to hi and 

, ,i . . . , i . ; i , . . . . i . . t i 



l(K) l )02 



()<) ^S4T o l » 



Docket No.: PF-0162-3 DIV 

The Hillier ft al. document does not teach a polynucleotide of Claim X. For at least the abo\e 
reasons. Applicants respectfully request that the Fxaminer w ithdraw the novelty rejection. 

X. Rejection of Claims 2-4 Under 35 l .S.C. § 103(a) as Being Unpatentable Over Hillier 
et al. in View of Simula et al. 

The Fxaminer rejected (Maims 2-4 under 35 I .S C. $ 103(a) as being unpatentable over Hillier 
et al. (Accession Number \ lll i )15 ) in view of Simula et al. The Fxammer stated that "Hillier et al.. has 
been discussed above, however Hillier et al., does not disclose recombinant transformed cells 
comprising promoters/" that "Simula et al., developed Salmonella ixphii itmitrinm |v/c: txplumunam] 
strains that express human glutathione S-transferase (GST) (abstract). The GST was expressed using 
regulatable tae promoter expression systems (abstract)/* and that "it would have been prima facie 
obvious to modify the cell transformed with a recombinant polynucleotide as taught by Simula et al., 
w ith the polynucleotide of Hillier et al/' (Office Action, page 13.) 
Claim 2 is amended as follows: 

2. An isolated polynucleotide encoding a polypeptide comprising an ammo 
acid sequence selected from the group consisting of: 

a) an amino acid sequence of SFQ ID NO:l, and 

b) a naturally-occurring amino acid sequence tuning at least c )CKr 
sequence identity to the sequence of SFQ ID NO: 1 over the entire 
length of SFQ ID NO:l. 

Neither Hillier et al. document nor the Simula et al. document teaches or suggests a 
polynucleotide encoding a polypeptide comprising an ammo acid sequence selected from the group 
consisting ol a i an ammo acid sequence of SFQ ID NO: 1. and hi a naturally-occurring amino acid 
sequence ha\ mg at least ( )()' ; sequence identity to the sequence of SFQ ID N( ). 1 o\cr the entire length 
of SFQ ID N( ): I . Therefore, neither the Hillier et al. document nor the Simula et al. document, either 
alone or in combination, renders obvious ( "lam is 2-4. For at least the above reasons. Applicants 
respectfully request that the Fxaminer withdraw the obviousness rcicction 
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In liy ht of the above amendments and remarks. Applicants submit that the present application is 
fully in condition for allowance, anil request that the Examiner withdraw the outstanding objections anil 
rejections. Early notice to that effect is earnestly solicited. 

If the Examiner contemplates other action, or if a telephone conference would expedite 
allow ance of the claims. Applicants inv ite the Hxaminer to contact Applicants' Agent at 
(650) 845-4646. 

Applicants believe that no tee is due w ith this communication. However, if the I'SPTO 
determines that a tee is due, the Commissioner is hereby authorized to charge Deposit Account No. 
09-0108. 



Respectfully submitted, 
INCYTE GENOMICS, INC. 

Date: ~^ 7 ^ ^ l$i2 0Qh — ^U^_ 7, ^Cl^jl^ 

Susan K. Sather 
Reg. No. 44,316 

Direct Dial Telephone: (650) 845-4646 

3160 Porter Drive 
Halo Alio. California c )43()4 
Phone: (650) S55-0555 
l-av ((o()i S4 ( )-SSS() 
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VERSION WITH MARKINGS TO SHOW ( II ANGES MADE 



IN THE SPECIFICATION: 

Paragraph beginning at page 1, line I, has been amended as follows: 



This application is a divisional application of U.S. application serial number 09/309.320. filed 
May 11. 1999, issued June 19, 2001. as U.S. Patent No. 6,248,325, which is a divisional of U.S. 
application serial number 09/096,57 L tiled June 12, 1998, issued November 2, 1999, as U.S. Patent 
No. 5,976,528. which is a divisional application of U.S. application serial number 08/756,77 1 . tiled 
November 26. 1996. issued October 6. 199X ;is 1 ! S Patent No S X 1 7 -107 1 1 <s 

" - ... ~ ,~ . "| | 

numbers 09/309,320, 09/096,571, and 08/756,771 are hereby expressly incorporated by reference. 



IN THE CLAIMS: 

Claim 9 has been canceled. 

Claims 2 and 8 have been amended as follows: 



2. (Once Amended) An isolated polynucleotide encoding a polypeptide [of claim 1 1 
comprising an amino acid sequence selected from the group consisting of: 
a) an amino acid sequence of SEQ ID NO: 1. and 

b) a naturally-occurring amino acid sequence having at least 9Q ( / ( sequence identity to 

the sequence of SEQ ID NO: 1 over th e entire I c ngU n i f SEQ ID NO: 1 

^ (Once Amended) An isolated polynucleotide comprising a sequence selected from the 

group consisting of: 

a) a polynucleotide sequence of SEQ ID NO:2. 

b> a naturally-occurring polynucleotide sequence ha\ing at least 90'; sequence ldcntitx 
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d> a polynucleotide sequence completely complementary to hi and 
e> a nhonucleotide equivalent of ai d). 



Proc Natl Acad Sci USA 

Vol. 95, pp. 607V 6078. May 1998 

Biochemistry 



Assessing sequence comparison methods with reliable structurally 
identified distant evolutionary relationships 

Steven E. Brenner* 1 "*, Cyrus Chothia*, and Tim J. P. Hubbard§ 

*MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH. United Kjngdom. and ^Sanger Centre, Wellcome Trust Genome Campus. Hiruton. 
Cambs CB10 ISA, United Kingdom 

Communicated by David R Davies, National Institute of Diabetes, Bethesda, MD. March 16, 1998 ^received for review November 12. 1997) 



ABSTRACT Pairwise sequence comparison methods have 
been assessed using proteins whose relationships are known 
reliably from their structures and functions, as described in 
the SCOP database [Murzin, A. G., Brenner, S. E., Hubbard, T. 
& Chothia C. (1995) /. Mol. Biol. 247, 536-540]. The evalua- 
tion tested the programs BLAST [AJtschul, S. F., Gish, W., 
Miller, W., Myers, E. W. & Lipman, D. J. (1990). J. Mol. Biol. 
215, 403-410], WU-BLAST2 [Altschul, S. F. & Gish, W. (1996) 
Methods Enzymol. 26b, 460-480], kama [FearMiii, W. R. & 
Lipman, D. J. (1988) Proc. Natl. Acad. Sci. USA 85, 2444-2448], 
and SSEARCH [Smith, T. F. & Waterman, M. S. (1981) /. Mol. 
Biol. 147, 195-197] and their scoring schemes. The error rate 
of alt algorithms is greatly reduced by using statistical scores 
to evaluate matches rather than percentage identity or raw 
scores. The E -value statistical scores of SSEARCH and FAST a are 
reliable: the number of false positives found in our tests agrees 
well with the scores reported. However, the P-values reported 
by BLAST and wi -BLAST: exaggerate significance by orders of 
magnitude. SSEARCH, FAST a ktup = 1, and WU-BLAST2 perform 
best, and they are capable of detecting almost all relationships 
between proteins whose sequence identities are >30%. For 
more distantly related proteins, they do much less well; only 
one-half of the relationships between proteins with 20-30% 
identity are found. Because many homologs have low sequence 
similarity, most distant relationships cannot be detected by 
any pairwise comparison method; however, those which are 
identified may be used with confidence. 



Sequence database searching plays a role in virtually every 
branch of molecular biology and is crucial for interpreting the 
sequences issuing forth from genome projects. Given the 
method's central role, it is surprising that overall and relative 
capabilities of different procedures are largely unknown. It is 
difficult to verify algorithms on sample data because this 
requires large data sets of proteins whose evolutionary rela 
tionships are known unambiguously and independently of the 
methods being evaluated However, nearly all known ho- 
mologs have been identified by sequence analysis (the method 
to be tested). Also, it is generally very difficult to know, in the 
absence of structural data, whether two proteins that lack clear 
sequence similarity are unrelated. This has meant that al- 
though previous evaluations have helped improve sequence 
comparison, they have suffered from insufficient, imperfectly 
characterized, or artificial test data. Assessment also has been 
problematic because high quality database sequence searching 
attempts to have both sensitivity (detection of homologs) and 
specificity (rejection of unrelated proteins), however, these 
complementary goals are linked such that increasing one 



Sequence comparison methodologies have evolved rapidly, 
so no previously published tests has evaluated modern versions 
of programs commonly used. For example, parameters in 
blast (1) have changed, and wu-blast: (2) — which produces 
gapped alignments — has become available. The latest version 
of fast a (3) previously tested was 1 6, but the current release 
(version 3.0) provides fundamentally different results in the 
form of statistical scoring. 

The previous reports also have left gaps in our knowledge. 
For example, there has been no published assessment of 
thresholds for scoring schemes more sophisticated than per- 
centage identity. Thus, the widely discussed statistical scoring 
measures have never actually been evaluated on large data- 
bases of real proteins. Moreover, the different scoring schemes 
commonly in use have not been compared. 

Beyond these issues, there is a more fundamental question: 
in an absolute sense, how well does pairwise sequence com- 
parison work 9 That is, what fraction of homologous proteins 
can be detected using modern database searching methods? 

In this work, we attempt to answer these questions and to 
overcome both of the fundamental difficulties that have hin- 
dered assessment of sequence comparison methodologies. 
First, we use the set of distant evolutionary relationships in the 
scop: Structural Classification of Proteins database (4), which 
is derived from structural and functional characteristics (5). 
The SCOP database provides a uniquely reliable set of ho- 
mologs, which are known independently of sequence compar- 
ison Second, we use an assessment method that jointly mea- 
sures both sensitivity and specificity. This method allows 
straightforward comparison of different sequence searching 
procedures. Further, it can be used to aid interpretation of real 
database searches and thus provide optimal and reliable 
results. 

Previous Assessments of Sequence Comparison. Several 
previous studies have examined the relative performance of 
different sequence comparison methods. The most encom- 
passing analyses have been by Pearson (6, 7), who compared 
the three most commonly used programs. Of these, the Smith- 
Waterman algorithm (8) implemented in ssearch (3) is the 
oldest and slowest but the most rigorous Modern heuristics 
have provided blast (1) the speed and convenience to make 
it the most popular program Intermediate between these two 
is fast a (3), which may be run in two modes offering either 
greater speed (ktup = 2) or greater effectiveness (ktup = 1 ). 
Pearson also considered different parameters for each of these 
programs 

To test the methods. Pearson selected two representative 
proteins from each of 67 protein superfamilies defined by the 
pir database (9) Each was used as a query to search the 
database, and the matched proteins were marked as being 
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superfamilies. Pearson found that modern matrices and "In- 
scaling" of raw scores improve results considerably. He also 
reported that the rigorous Smith-Waterman algorithm worked 
slightly better than fast a, which was in turn more effective 
than BLAST. 

Very large scale analyses of matrices have been performed 
(10), and Henikoff and Henikoff (11) also evaluated the 
effectiveness of blast and fasta. Their test with blast 
considered the ability to detect homologs above a predeter- 
mined score but had no penalty for methods which also 
reported large numbers of spurious matches. The Henikoffs 
searched the swiss-PROT database (12) and used PROSITE (13) 
to define homologous families. Their results showed that the 
BLOSUM62 matrix (14) performed markedly better than the 
extrapolated PAM-series matrices (15), which previously had 
been popular. 

A crucial aspect of any assessment is the data that are used 
to test the ability of the program to find homologs. But in 
Pearson's and the Henikoffs' evaluations of sequence com- 
parison, the correct results were effectively unknown. This is 
because the superfamilies in pir and PROSITE are principally 
created by using the same sequence comparison methods 
which are being evaluated. Interdependent of data and 
methods creates a "chicken and egg" problem, and means for 
example, that new methods would be penalized for correctly 
identifying homologs missed by older programs. For instance, 
immunoglobulin variable and constant domains are clearly 
homologous, but PIR places them in different superfamilies. 
The problem is widespread: each superfamily in PIR 48.00 with 
a structural homolog is itself homologous to an average of 1.6 
other PIR superfamilies (16). 

To surmount these sorts of difficulties, Sander and Schnei- 
der (17) used protein structures to evaluate sequence com- 
parison. Rather than comparing different sequence compari- 
son algorithms, their work focused on determining a length- 
dependent threshold of percentage identity, above which all 
proteins would be of similar structure. A result of this analysis 
was the HSSP equation; it states that proteins with 25% identity 
over 80 residues will have similar structures, whereas shorter 
alignments require higher identity (Other studies also have 
used structures (18-20), but these focused on a small number 
of model proteins and were principally oriented toward eval- 
uating alignment accuracy rather than homology detection.) 

A general solution to the problem of scoring comes from 
statistical measures (i.e., E-values and P-values) based on the 
extreme value distribution (21). Extreme value scoring was 
implemented analytically in the BLAST program using the 
Karlin and Altschul statistics (22, 23) and empirical ap- 
proaches have been recently added to fasta and ssearch. In 
addition to being heralded as a reliable means of recognizing 
significantly similar proteins (24, 25), the mathematical trac- 
tability of statistical scores "is a crucial feature of the blast 
algorithm" ( 1 ). The validity of this scoring procedure has been 
tested analytically and empirically (see ref. 2 and references in 
ref. 24). However, all large empirical tests used random 
sequences that may lack the subtle structure found within 
biological sequences (26, 27) and obviously do not contain any 
real homologs. Thus, although many researchers have sug- 
gested that statistical scores be used to rank matches (24, 25, 
28). there have been no large rigorous experiments on biolog- 
ical data to determine the degree to which such rankings are 
superior. 

A Database for Testing Homology Detection. Since the 
discovery that the structures of hemoglobin and myoglobin are 
very similar though their sequences are not (29), it has been 
apparent that comparing structures is a more powerful (if less 
convenient) way to recognize distant evolutionary relation- 
ships than comparing sequences. If two proteins show a high 
degree of similarity in their structural details and function, it 



is very probable that they have an evolutionary relationship 
though their sequence similarity may be low. 

The recent growth of protein structure information com- 
bined with the comprehensive evolutionary classification in 
the scop database (4, 5) have allowed us to overcome previous 
limitations. With these data, we can evaluate the performance 
of sequence comparison methods on real protein sequences 
whose relationships are known confidently. The SCOP database 
uses structural information to recognize distant homologs, the 
large majority of which can be determined unambiguously. 
These superfamilies, such as the globins or the immunoglobu- 
lins, would be recognized as related by the vast majority of the 
biological community despite the lack of high sequence sim- 
ilarity. 

From scop, we extracted the sequences of domains of 
proteins in the Protein Data Bank (pdb) (30) and created two 
databases. One (PDB90D-B) has domains, which were all <90% 
identical to any other, whereas (PDB40D-B) had those <40% 
identical. The databases were created by first sorting all 
protein domains in scop by their quality and making a list. The 
highest quality domain was selected for inclusion in the 
database and removed from the list. Also removed from the list 
(and discarded) were all other domains above the threshold 
level of identity to the selected domain. This process was 
repeated until the list was empty. The PDB40D-B database 
contains 1,323 domains, which have 9,044 ordered pairs of 
distant relationships, or -=0.5% of the total 1,749,006 ordered 
pairs. In PDB90D-B. the 2,079 domains have 53,988 relation- 
ships, representing 1.2% of all pairs. Low complexity regions 
of sequence can achieve spurious high scores, so these were 
masked in both databases by processing with the SEG program 
(27) using recommended parameters: 12 1.8 2.0. The databases 
used in this paper are available from http://sss.stanford.edu/ 
sss/, and databases derived from the current version of SCOP 
may be found at http://scop.mrc-lmb.cam.ac.uk/scop/. 

Analyses from both databases were generally consistent, but 
PDB40D-B focuses on distantly related proteins and reduces the 
heavy overrepresentation in the PDB of a small number of 
families (31. 32). whereas PDB90D-B (with more sequences) 
improves evaluations of statistics. Except where noted other- 
wise, the distant homolog results here are from PDB40D-B. 
Although the precise numbers reported here are specific to the 
struciural domain databases used, we expect the trends to be 
general. 

Assessment Data and Procedure. Our assessment of se- 
quence comparison may be divided into four different major 
categories of tests. First, using just a single sequence compar- 
ison algorithm at a time, we evaluated the effectiveness of 
different scoring schemes. Second, we assessed the reliability 
of scoring procedures, including an evaluation of the validity 
of statistical scoring. Third, we compared sequence compari- 
son algorithms (using the optimal scoring scheme) to deter- 
mine their relative performance. Fourth, we examined the 
distribution of homologs and considered the power of pairwise 
sequence comparison to recognize them. All of the analyses 
used the databases of structurally identified homologs and a 
new assessment criterion. 

The analyses tested blast (1), version 1.4.9MP, and wu- 
blast: (2), version 2.0a 13MP. Also assessed was the fasta 
package, version 3.0t76 (3), which provided fasta and the 
ssearch implementation of Smith-Waterman (8). For 
ssearch and fasta, we used BLOSUM45 with gap penalties 
-12-1 (7. 16) The default parameters and matrix (BLO- 
SUM62 ) were used for blast and wtj-blast:. 

The "Coverage Vs. Error" Plot. To test a particular protocol 
(comprising a program and scoring scheme), each sequence 
from the database was used as a query to search the database. 
This yielded ordered pairs of query and target sequences with 
associated scores, which were sorted, on the basis of their 
scores, from best to worst. The ideal method would have 
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Fig. 1. Coverage vs error plots of different scoring schemes for sseaRCH Smith-Waterman. (A) Analysis of PDB4or>B database. (B) Analysis 
of pdbwd-b database. All of the proteins in the database were compared with each other using the ssearch program. The results of this single 
set of comparisons were considered using five different scoring schemes and assessed. The graphs show the coverage and errors per query (EPQ) 
for statistical scores, raw scores, and three measures using percentage identity. In the coverage vs. error plot, the x axis indicates the fraction of 
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same fold divided by the total number of pairs from a common superfamiiy. PDB40D-B contains a total of 9,044 homologs, so a score of 10% indicates 
identification of 904 relationships. The v axis reports the number of EPQ. Because there are 1,323 queries made in the PDB40D-B all-vs.-all 
comparison, 13 errors corresponds to 0.01, or \% EPQ. They axis is presented on a log scale to show results over the widely varying degrees of 
accuracy which may be desired. The scores that correspond to the levels of EPQ and coverage are shown in Fig. 4 and Table 1. The graph 
demonstrates the trade-off between sensitivity and selectivity. As more homologs are found (moving to the right), more errors are made (moving 
up). The ideal method would be in the lower right corner of the graph, which corresponds to identifying many evolutionary relationships without 
selecting unrelated proteins. Three measures of percentage identity are plotted. Percentage identity within alignment is the degree of identity within 
the aligned region of the proteins, without consideration of the alignment length. Percentage identity within both is the number of identical residues 
in the aligned region as a percentage of the average length of the query and target proteins. The HSSP equation (17) is H = 290.15/"° 562 where 
/ is length for 10 < / < 80; H > 100 for / < 10; H = 24.7 for / > 80. The percentage identity HSSP-adjusted score is the percent identity within 
the alignment minus H. Smith- Waterman raw scores and E-values were taken directly from the sequence comparison program. 



perfect separation, with all of the homologs at the top of the 
list and unrelated proteins below. In practice, perfect separa- 
tion is impossible to achieve so instead one is interested in 
drawing a threshold above which there are the largest number 
of related pairs of sequences consistent with an acceptable 
error rate. 

Our procedure involved measuring the coverage and error 
for every threshold. Coverage was defined as the fraction of 
structurally determined homologs that have scores above the 
selected threshold; this reflects the sensitivity of a method. 
Errors per query (EPQ), an indicator of selectivity, is the 
number of nonhomologous pairs above the threshold divided 
by the number of queries. Graphs of these data, called 
coverage vs. error plots, were devised to understand how 
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protocols compare at different levels of accuracy. These 
graphs share effectively all of the beneficial features of Re- 
ciever Operating Characteristic (ROC) plots (33, 34) but 
better represent the high degrees of accuracy required in 
sequence comparison and the huge background of nonho- 
mologs. 

This assessment procedure is directly relevant to practical 
sequence database searching, for it provides precisely the 
information necessary to perform a reliable sequence database 
search. The EPQ measure places a premium on score consis- 
tency; that is, it requires scores to be comparable for different 
queries. Consistency is an aspect which has been largely 
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Fig. 4. Reliability of statistical scores in pdbwd-b: Each line shows 
the relationship between reported statistical score and actual error 
rate for a different program. E-values are reported for ssearch and 
Fasta. whereas P-values are shown for blast and wu-BLAST2. If the 
scoring were perfect, then the number of errors per query and the 
E-values would be the same, as indicated by the upper bold line. 
(P-values should be the same as EPQ for small numbers, and diverges 
at higher values, as indicated by the lower bold line.) E-values from 
sseaRCH and fasta are shown to have good agreement with EPQ but 
underestimate the significance slightly, blast and wu-blast2 are 
overconfident, with the degree of exaggeration dependent upon the 
score. The results for PDB40D-B were similar to those for PDB90D-B 
despite the difference in number of homologs detected. This graph 
could be used to roughly calibrate the reliability of a given statistical 
score. 

ignored in previous tests but is essential for the straightforward 
or automatic interpretation of sequence comparison results. 
Further, it provides a clear indication of the confidence that 
should be ascribed to each match. Indeed, the EPQ measure 
should approximate the expectation value reported by data- 
base searching programs, if the programs' estimates are accu- 
rate. 

The Performance of Scoring Schemes. All of the programs 
tested could provide three fundamental types of scores. The 
first score is the percentage identity, which may be computed 
in several ways based on either the length of the alignment or 
the lengths of the sequences. The second is a "raw" or 
"Smith-Waterman" score, which is the measure optimized by 
the Smith-Waterman algorithm and is computed by summing 
the substitution matrix scores for each position in the align- 
ment and subtracting gap penalties. In blast, a measure 

Sequence Comparison Algorithms (PDB40D-B) 
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related to this score is scaled into bits. Third is a statistical 
score based on the extreme value distribution. These results 
are summarized in Fig. i. 

Sequence Identity. Though it has been long established that 
percentage identity is a poor measure (35). there is a common 
rule-of-thumb stating that 309c identity signifies homology. 
Moreover, publications have indicated that 25^ identity can 
be used as a threshold (17, 36). We find that these thresholds, 
originally derived years ago, are not supported by present 
results. As databases have grown, so have the possibilities for 
chance alignments with high identity; thus, the reported cutoffs 
lead to frequent errors. Fig. 2 shows one of the many pairs of 
proteins with very different structures that nonetheless have 
high levels of identity over considerable aligned regions. 
Despite the high identity, the raw and the statistical scores tor 
such incorrect matches are typically not significant. The prin- 
cipal reasons percentage identity does so poorly seem to be 
that it ignores information about gaps and about the conser- 
vative or radical nature of residue substitutions. 

From the PDBWD-B analysis in Fig. 3, we learn that 30% 
identity is a reliable threshold for this database only for 
sequence alignments of at least 150 residues. Because one 
unrelated pair of proteins has 43.5% identity over 62 residues, 
it is probably necessary for alignments to be at least 70 residues 
in length before 40% is a reasonable threshold, for a database 
of this particular size and composition. 

At a given reliability, scores based on percentage identity 
detect just a fraction of the distant homologs found by 
statistical scoring. If one measures the percentage identity in 
the aligned regions without consideration of alignment length, 
then a negligible number of distant homologs are detected. 
Use of the HSSP equation improves the value of percentage 
identity, but even this measure can find only 4% of all known 
homologs at 1% EPQ. In short, percentage identity discards 
most of the information measured in a sequence comparison. 

Raw Scores. Smith-Waterman raw scores perform better 
than percentage identity (Fig. 1 ), but ln-scaling (7) provided no 
notable benefit in our analysis. It is necessary to be very precise 
when using either raw or bit scores because a 20% change in 
cutoff score could yield a tenfold difference in EPQ. However, 
it is difficult to choose appropriate thresholds because the 
reliability of a bit score depends on the lengths of the proteins 
matched and the size of the database. Raw score thresholds 
also are affected by matrix and gap parameters. 

Statistical Scores. Statistical scores were introduced partly 
to overcome the problems that arise from raw scores. This 
scoring scheme provides the best discrimination between 
homologous proteins and those which are unrelated. Most 
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FlG. 5. Coverage vs. error plots of different sequence comparison methods: Five different sequence comparison methods are evaluated, each 
using statistical scores (E- or P-values). {A ) PDB40D-B database. In this analysis, the best method is the slow ssearch. which finds 18% of relationships 
at \% EPQ. fasta letup = 1 and wu-blast: are almost as good (B) PDB90D-B database. The quick wu-Bb*ST2 program provides the best coverage 
at \9c EPQ on this database, although at higher ievels of error it becomes slightly worse than fasta ktup = 1 and ssearch 
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likely, its power can be attributed to its incorporation of more 
information than any other measure; it takes account of the 
full substitution and gap data (like raw scores) but also has 
details about the sequence lengths and composition and is 
scaled appropriately 

We find that statistical scores are not only powerful, but also 
easy to interpret ssearch and fasta show close agreement 
between statistical scores and actual number of errors per 
query (Fig 4). The expectation value score gives a good, 
slightly conservative estimate of the chances of the two se- 
quences being found at random in a given query Thus, an 
E-value of 0.01 indicates that roughly one pair of nonhomologs 
of this similarity should be found in every 100 different queries. 
Neither raw scores nor percentage identity can be interpreted 
in this way, and these results validate the suitability of the 
extreme value distribution for describing the scores from a 
database search. 

The P-values from blast also should be directly interpret- 
able but were found to overstate significance by more than two 
orders of magnitude for \% EPQ for this database. Nonethe- 
less, these results strongly suggest that the analytic theory is 
fundamentally appropriate, wu-blast: scores were more re- 
liable than those from ri ast hyt also exaggerate expected 
confidence by more than an order of magnitude at \% EPQ. 

Overall Detection of Homologs and Comparison of Algo- 
rithms. The results in Fig. SA and Table 1 show that pairwise 
sequence comparison is capable of identifying only a small 
fraction of the homologous pairs of sequences in PDB40D-B. 
Even SSEARCH with E-values. the best protocol tested, could 
find only 18% of all relationships at a \% EPQ. BLAST, which 
identifies 15%, was the worst performer, whereas fasta 
ktup = 1 is nearly as effective as SSEARCH. fasta ktup = 2 and 
WU-BLAST2 are intermediate in their ability to detect ho- 
mologs. Comparison of different algorithms indicates that 
those capable of identifying more homologs are generally 
slower. SSEARCH is 25 times slower than BLAST and 6 5 times 
slower than fasta ktup = 1. WU-BLAST2 is slightly faster than 
fasta ktup = 2, but the latter has more interpretable scores. 

In PDBWD-B, where there are many close relationships, the 
best method can identify only 38% of structurally known 
homologs (Fig. 55). The method which finds that many 
relationships is wu-blastz Consequently, we infer that the 
differences between fasta kup = 1, SSEARCH. and wu-BLAST2 
programs are unlikely to be significant when compared with 
variation in database composition and scoring reliability. 

Fig. 6 helps to explain why most distant homologs cannot be 
found by sequence comparison: a great many such relation- 
ships have no more sequence identity than would be expected 
by chance, ssearch with E-values can recognize >90% of the 
homologous pairs with 30-40% identity. In this region, there 
are 30 pairs of homologous proteins that do not have signif- 
icant E-values, but 26 of these involve sequences with <5l) 
residues. Of sequences having 25-30% identity, 75% arc- 
identified by ssearch E-values However, although the num- 
ber of homologs grows at lower levels of identity, the detection 
falls off sharply: only 40% of homologs with 20-25% identity 
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Fig. 6 Distribution and detection of homologs in PDB40D-B. Bars 
show the distribution of homologous pairs pd&wd-b according to their 
identity (using the measure of identity in both). Filled regions indicate 
the number of these pairs found by the best database searching method 
(ssearch with E-values) at 1% EPQ. The PDB40D-B database contains 
proteins wirh <40^ identity, and shown on this giaph, most 
structurally identified homologs in the database have diverged ex- 
tremely far in sequence and have <20% identity. Note that the 
alignments may be inaccurate, especially at low levels of identity. Filled 
regions show that ssearch can identify most relationships that have 
25% or more identity, but its detection wanes sharply below 25%. 
Consequently, the great sequence divergence of most structurally 
identified evolutionary relationships effectively defeats the ability of 
pariwise sequence comparison to detect them. 

are detected and only 10% of those with 15-20% can be found. 
These results show that statistical scores can find related 
proteins whose identity is remarkably low; however, the power 
of the method is restricted by the great divergence of many 
protein sequences. 

After completion of this work, a new version of pairwise 
BLAST was released: blastgp (37). It supports gapped align- 
ments, like wu-BLAST2, and dispenses with sum statistics. Our 
initial tests on blastgp using default parameters show that its 
E-values are reliable and that its overall detection of homologs 
was substantially better than that of ungapped blast, but not 
quite equal to that of wu-blast:. 

CONCLUSION 

The general consensus amongst experts (see refs. 7. 24, 25, 27 
and references therein) suggests that the most effective se- 
quence searches are made by (/) using a large current database 
in which the protein sequences have been complexity masked 
and (//) using statistical scores to interpret the results Our 
experiments fulls support this view 

Our results also suggest two further points First, the E-val- 
ues reported bv fasta and sst arc H give fairly accurate 
estimates ot the significance of each match, but the P-values 
provided by blast and wu-blast: underestimate the true 
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extent of errors. Second. SSEARCH, wu-blast:, and Fasta 
ktup = 1 perform best, though blast and fasta ktup = 2 
detect most of the relationships found by the best procedures 
and are appropriate for rapid initial searches. 

The homologous proteins that are found by sequence com- 
parison can be distinguished with high reiiabiiity from the huge 
number of unrelated pairs. However, even the best database 
searching procedures tested fail to find the large majority of 
distant evolutionary relationships at an acceptable error rate. 
Thus, if the procedures assessed here fail to find a reliable 
match, it does not imply that the sequence is unique; rather, it 
indicates that any relatives it might have are distant ones.** 



"Additional and updated information about this work, including 
supplementary figures, may be found at http://sss.stanford.edu/sss/. 
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1 . An important fearure of the work of manv molecular biologists is lUentitving which 
genes are switched on and off in a cell under different environmental conditions or 
subsequent to xenobiotic challenge. Such information has manv uses, including the 
deciphering of molecular pathways and facilitating the development ot new experimental 
and diagnostic procedures. However, the student of gene hunting should be forgiven tor 
perhaps becoming confused by the mountain of information available as there appears to be 
almost as many methods of discovering differentially expressed genes as there are research 
groups using the technique. 

2. The aim of this review was to clarify the main methods of differential gene expression 
analvsis and the mechanistic principles underlying them. Also included is a discussion on 
some of the practical aspects of using this technique. Emphasis is placed on the so-called 
' open ' svstems, which require no pnor knowledge of the genes contained within the study 
model. Whilst these will eventually be replaced by 1 closed * systems in the srudv of human, 
mouse and other commonly studied laboratory animals, thev will remain a powerful tool for 
those examining less fashionable models 

3. The use of suppression-PCR subtractive hybridization is exemplified in the 
identification of up- and down-regulated genes in rat liver following exposure to pheno- 
barbital, a well-known inducer of the drug metabolizing enzymes 

4. Differential gene display provides a coherent platform for building libraries and 
microchip arrays of gene fingerprints ' characteristic of known enzvme inducers and 
xenobiotic toxicants, which mav be interrogated subsequently for the identification and 
characterization of xenobiotics of unknown biological properties 
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Introduction 

It is now apparent that the development of almost all cancers and many non- 
neoplastic diseases are accompanied by altered gene expression in the affected cells 
compared to their normal state i Hunter 1991. Wynford- Thomas 1991. Vocelstein 
and Kinzler 1993, Semenza 1994. Cassidy 1995, Kleinjan and Van Hetmingen 1998). 
Such changes also occur m response to external stimuli such as pathogenic micro- 
organisms (Rohn et al. 1996, Singh et al. 1997, Griffin and Krishna 1998, Lunney 
1998) and xenobiotics (bewail et aL 1995, Dogra et al. 1998, Ramana and Kohli 
1998), as well as during the development of undifferentiated cells (Hecht 1998, 
Rudin and Thompson 1998, Schneider-Maunoury et al. 1998). The potential 
medical and therapeutic benefits of understanding the molecular changes which 
occur tn any given cell in progressing from the normal to the 'altered' state are 
enormous. Such profiling essentially provides a ' fingerprint * of each step of a 
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cell's development or response and should help in the elucidation of specific and 
sensitive biomarkers representing, for example, different types of cancer or previous 
exposure to certain classes of chemicals that are enzyme inducers. 

In drug metabolism, many of the xenobiotic-metabolizing enzymes (including 
the well-characterized isoforms of cytochrome P450) are inducible by drugs and 
chemicals in man (Pelkonen et al. 1998), predominantly involving transcriptional 
activation of not only the cognate cytochrome P450 genes, but additional cellular 
proteins which may be crucial to the phenomenon of induction. Accordingly, the 
development of methodology to identify and assess the full complement of genes 
that are either up- or down-regulated* by inducers are crucial in the development of 
knowledge to understand the precise molecular mechanisms of enzyme induction 
and how this relates to drug action. Similarly, in the field of chemical-induced 
toxicity, it is now becoming increasingly obvious that most adverse reactions to 
drugs and chemicals are the result of multiple gene regulation, some of which are 
causal and some of which are casually-related to the toxicological phenomenon per 
se. This observation has led to an upsurge in interest in gene-profiling technologies 
which differentiate between the control and toxin-treated gene pools m target tissues 
and is, therefore, of value in rationalizing the molecular mechanisms of xenobiotic- 
induced toxicity. Knowledge of toxin-dependent gene regulation in target tissues is 
not solely an academic pursuit as much interest has been generated in the 
pharmaceutical industry to harness this technology in the early identification of toxic 
drug candidates, thereby shortening the developmental process and contributing 
substantially to the safety assessment of new drugs. For example, if the gene profile 
m response to say a testicular toxin that has been well-characterized in vivo could be 
determined in the testis, then this profile would be representative of all new drug 
candidates which act via this specific molecular mechanism of toxicity, thereby 
providing a useful and coherent approach to the early detection of such toxicants. 
Whereas it would be informative to know the identity and functionality of all genes 
up/down regulated by such toxicants, this would appear a longer term goal, as the 
majority of human genes have not yet been sequenced, far less their functionality 
determined. However, the current use of gene profiling yields a pattern of gene 
changes for a xenobiotic of unknown toxicity which may be matched to that of well- 
characterized toxins, thus alerting the toxicologist to possible in vivo similarities 
between the unknown and the standard, thereby providing a plattorm for more 
extensive toxicological examination. Such approaches are beginning to gain 
momentum, in that several biotechnology companies are commercially producing 
'gene chips' or 'gene arrays' that may be interrogated for toxicity assessment of 
xenobiotics. These chips consist of hundreds/thousands of genes, some of which are 
degenerate- in the sense that not all of the genes are mechanistically-related to any 
one toxicological phenomenon. Whereas these chips are useful in broad- spectrum 
screening, they are maturing at a substantial rate, in that gene arrays are now 
becoming more specific, e.g. chips for the identification of changes m growth factor 
families that contribute to the aetiology and development of chemically-induced 
neoplasias. " — - _ 

Although documenting and explaining "these genetic changes presents a 
formidable obstacle to understanding the different mechanisms of development and 
disease progression, the technology is now avatkbleto begin attempting this difficult 
challenge. Indeed, several 'differential expression analysis' methods have been 
developed which facilitate the identification of gene products that demonstrate 
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Differential gene expression r> ;- 

altered expression in cells of one population compared to another. These methods 
have been used to identify- differential gene expression in many situations, including 
invading pathogenic microbes (Zhao et al. 1998), in cells responding to extracellular 
and intracellular microbial invasion (Duguid and Dinauer 1990, Ragno et al. 1997. 
Maldarelli et al. 1998), in chemically treated cells (Syed et al. 1997, Rockett et al. 
1999), neoplastic cells (Liang et al. 1992, Chang and Terzaghi-Howe 1998), 
activated cells (Gurskaya et al. 1996, Wan et al. 1996), differentiated cells <Hara et 
al. 1991, Guimaraes et al. 1995a, b), and different cell types (Davis et al. 1984, 
Hednck et al. 1984, Xhu et al. 1998). Although differential expression analysis 
technologies are applicable to a broad range of models, perhaps their most important 
ad vantage is that, in most cases, absolutely no prior know ledge of the specific genes 
which are up- or down-regulated is required. 

The field of differential expression analysis is a large and complex one, with 
many techniques available to the potential user. These can be categorized into 
several methodological approaches, including: 

(1) DifFci tiitikii screening, 

(2) Subtracnve hybridization (SH) (includes methods such as chemical cross- 
linking subtraction — CCLS, suppression-PCR subtracnve hybridization — 
SSH, and representational difference analysis — RDA), 

(3) Differential display (DD), 

(4) Restriction endonuclease facilitated analysis (including serial analysis of gene 
expression — SAGE — and gene expression fingerprinting — GEF), 

(5) Gene expression arrays, and 

(6) Expressed sequence tag (EST) analysis. 

The above approaches have been used successfully to isolate differentially 
expressed genes in different model systems. However, each method has its own 
subtle (and sometimes not so subtle) characteristics which incur various advantages 
and disadvantages. Accordingly, it is the purpose of this review to clarify the 
mechanistic principles underlying the main differential expression methods and to 
highlight some of the broader considerations and implications of this verv powerful 
and increasingly popular technique. Specifically, we will concentrate on the so- 
called 'open' systems, namely those which do not require any knowledge of gene 
sequences and, therefore, are useful for isolating unknown genes. Two 'closed' 
svstems (those utilising previously identified gene sequences). EST analysis and the 
use of DNA anavs, will aiscr be considered bneriv ror completeness. Whilst 
emphasis will often be placed on suppression PCR subtracnve hybridization (SSH, 
the approach employed in this laboratory), tt is the aim of the authors to highlight, 
wherever possible, those areas of common interest to those who use, or intend to use, 
differentia] gene expression analysis. 



Differential cDNA library screening (DS) 

Despite the development of multiple technological advances which have recently 
brought the field of gene expression profiling to the forefront of molecular analysis, 
recognition of the importance of differential gene expression and characterization of 
differentially expressed szenes has existed for many yp.ir* n nP n f -he nri C iriaT 
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hybridization 1 , which was used to isolate galactose-inducible DNA sequences from 
yeast. The theory is simple: a genomic DNA library is prepared from normal, 
unstimulated cells of the test organism/tissue and multiple filter replicas are 
prepared. These replica blots are probed with radioactively (or otherwise) labelled 
complex cDNA probes prepared from the control and test cell mRNA populations. 
Those mRNAs which are differentially expressed in the treated cell population will 
show a positive signal only on the filter probed with cDNA from the treated cells. 
Furthermore, labelled cDNA from different test conditions can be used to probe 
multiple blots, thereby enabling the identification of mRNAs which are only up- 
regulated under certain conditions. For example, St John and Davis ( 1 979) screened 
replica filters with acetate-, glucose- and galactose-derived probes in order to obtain 
genes induced specifically by galactose metabolism. Although groundbreaking in its 
time this method is now considered insensitive and time-consuming, as up to 2 
months are required to complete the identification of genes which are differentially 
expressed in the test population. In addition, there is no convenient way to check 
that the procedure has worked until the whole process has been completed. 

Subtractive Hybridization (SH) 

The developing concept of differential gene expression and the success of early 
approaches such as that described by St John and Davis (1979) soon gave rise to a 
search for more convenient methods of analysis. One of the first to be developed was 
SH, numerous variations of which have since been reported (see below). In general, 
this approach involves hybridization of mRNA/cDNA from one population (tester) 
to excess mRNA/cDNA from another (driver), followed by separation of the 
unhybndized tester fraction (differentially expressed) from the hybridized common 
sequences. This step has been achieved physically, chemically and through the use 
of selective polymerase chain reaction (PCR) techniques. 

Physical separation 

Original subtractive hybridization technology* involved the physical separation 
of hybridized common species from unique single stranded species. Several methods 
of achieving this have been described, including hydroxyapante chromatographv 
(Sargent and Dawid 1983), avidin-biotin technology (Duguid and Dmauer 1990) 
and oligodT-latex separation (Hara et al. 1991). In the first approach, common 
mRNA species are removed by cDNA (from test cells)-mRNA (from control cells) 
subtractive hybridization followed by hydroxyapatite chromatography, as hydroxy- 
apatite specifically adsorbs the cDNA-mRNA hybrids. The unabsorbed cDNA is 
then used either for the construction of a cDNA library of differentially expressed 
genes (Sargent and Dawid 1983, Schneider et al. 1988) or directly as a probe to 
screen a preselected library (Zimmerman et al. 1 980, Davis et al. 1 984, Hedrick et al. 
1984). A schematic diagram of the procedure is shown in figure 1. 

Less rigorous physical separation procedures coupled with sensitivity enhancing 
PCR steps were later developed as a means to overcome some of the problems 
encountered with the hydroxyapatite procedure. For example, Daguid and Dinauer 
(1990) described a method of subtraction utilizing biotin-affinity systems as a means 
to remove hybridized common sequences. In this process, both the control and 
tester mRNA populations are first converted to cDNA and an adaptor (' oligovector \ 
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Figure 1. The hvdroxvapatite method of subtraenve hvbndization cDNA derived trom the 
treated /altered (tester) population is mixed with a iar«e excess or mRNA rrom the control i driven 
population Following hvbndization. mRNA-cDNA hvbnda are removed bv nvoroxvaDante 
chromato^rapnv The oniv cDNAs wrucn remain are tnose wmcn are airrerentiaiiv expressea :n 
the treated/ altered population. In order to taciittate the recovery or lull ientrtn clones, small cDNA 
fragments are removed by exclusion chromatographv. The remaining cDN As are then cloned into 
a vector tor sequencing, or iaDelled and used directlv to probe a librae js described bv Sarsent 
and Dauid i 1983 ) 

containing a restriction site) ligated to both sides. Both populations are then 
amplified by PCR, but the driver cDNA population is subsequently digested with 
the adaptor-containing restriction endonuclease. This serves to cleave the oligo- 
vector and reduce the amplification potential of the control population. The digested 
control population is then biotinylated and an excess mixed with tester cDNA. 
Following denaturation and hybridization, the mix is applied to a biocynn column 
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AAAA AAAA 

AAAA AAAA 



Anneal mRNA to poiydT» latex beads 





AAAA 

cDNA synthesis 



Mix and anneal 



AAAA AAAA 




Centrifuge beads, collect and store supernatant, 
dissociate polyA, reapply suoernatant 

AAAA Tester-specific mRNA retrieved after 

AAAA 4 rounds of hybndizatoon 

I 

cDNA synthesis 
Ugate adaptors and insert into vector 

I 

Sequence inserts and/or carry out 
other downstream applications 

Figure 2. The use of oligodT w latex to perform subtractive hybridization. mRNA extracted from the 
control (driver) population is converted to anchored cDNA using polydT oligonucleotides 
attached to latex beads. mRNA from the treated/ altered (tester) population is repeatedly 
hybridized against an excess of the anchored driver cDNA. The final population of mRNA is 
tester specific and can be converted into cDNA for cloning and other downstream applications, as 
described by Hara et al. (\9<H). 
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control cDNA. In order to further enrich those species differentially expressed in 

the tester cDNA, the subtracted tester population is amplified by PCR following 
'AAAA 

• every second subtraction cycle. After six cycles of subtraction (three reamphncation 

steps) the reaction mix is Hgated into a vector for further analysis. 
| in a slightly different approach, Hara et ai. (1991) utilized a method whereby 

| oligo(dT 30 ) primers attached to a latex substrate are used to hrst capture mRNA 

extracted from the control population. Following 1st strand cDNA synthesis, the 
RNA strand of the heteroduplexes is removed by heat denaturation and centn- 
fugation (the cDN A-oligotex-dT 30 forms a pellet and the supernatant is removed). 
A quantity of tester mRNA is then repeatedly hybridized to the immobilized control 
(driver) cDNA (which is present in 20-fold excess). After several rounds of 
hybridization the only mRNA molecules left in the tester mRNA population are 
those which are not foundin the driver cDNA-oligotex-dT 30 population. These 
tester-specific mRNA species are then converted to cDNA and. following the 
addition of adaptor sequences, amplified by PCR. The PCR products are then 
Hgated into a vector for further analysis using restriction sites incorporated into the 
PCR primers. A schematic illustration of this subtraction process is shown in figure 

However, all these methods utilising physical separation have been described as 
inefficient due to the requirement for large starting amounts of mRNA, significant 
loss of material during the separation process and a need for several rounds of 
hybridization. Hence, new methods of differential expression analysis have recently 
been designed to eliminate these problems. 
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Chemical Cross-Linking Subtraction ( CCLS ) 

In this technique, originally described by Hampson et ai (1992), driver mRNA 
is mixed with tester cDNA (1st strand only) in a ratio of > 20: 1. The common 
sequences form cDNAimRNA hybrids, leaving the tester specific species as single 
stranded cDNA. Instead of physically separating these hybrids, they are inactivated 
chemically using 2,5 diazindinyl- 1 ,4-benzoquinone (DZQ). Labelled probes are 
then synthesized from the remaining single stranded cDNA species (unreacted 
mRNA species remaining from the driver are not converted into probe material due 
to specificity of Sequenase T7 DNA polymerase used to make the probe) and used 
to screen a cDNA hbrarv made from the tester cell population. A scnemanc diagram 
of the system is shown in figure 3. 

It has been shown that the differentially expressed sequences can be enriched at 
least 300-fold with one round of subtraction (Hampson et al. 1992), and that the 
technique should allow isolation of cDNAs derived from transcripts that are present 
at less than 50 copies per cell. This equates to genes at the low end of intermediate 
abundance (see table 1). The main advantages of the CCLS approach are that it is 
rapid, technically simple and also produces fewer false positives than other 
differential expression analysis methods. However, like the physical separation 
protocols, a major drawback with CCLS is the large amount of starting material 
required (at least 10 RNA). Consequently, the technique has recently been 
-"fined so *-hat a renewable source of RNA can be generated The degenerate random 



662 



J, C. Rockett er al. 



Control (driver) mRNA 



-AAAA 
-AAAA 



mRNA:cDNA hybrids 
Unique cDNA species 



Test (tester) mRNA 

AAAA 

*AAAA 



1st strand cDNA synthesis , 
followed by alkaline hydrolysis ^ 



Mix and anneal 



-AAAA 



Cross linking agent 
(DZQ) added 



Hybnds are cross-linked xxxxxxxxx 



■AAAA 




Probes syntriesised from single stranded cDNA 
species and used to probe cDNA library 

Figure 3. Chemical cross- linking subtraction. Excess driver mRNA is mixed with l" 1 strand tester 
cDSA. The common sequence* rorm mRNA:cDXA nvDnos wrucn axe cross un*ea with -.5 
ciiazmdrnvl- \ .4-rvrnzoqurnone (DZQ) and me remaining cDNA sequences are dinerentiailv 
expressed in the tester population. Probes are made from these sequences using Sequenase 2.0 
DNA polymerase, which lacks reverse transcriptase activity and, therefore, does not react with the 
remaining mRNA molecules from the driver. The labelled probes axe then used to screen a cDNA 
library for clones of differentially expressed sequences. Adapted from Walter et al. (1996), with 
permission. - - — 



Table 1. The abundance of mRNA species and classes in a rypical mammalian cell. 



mRNA 
class 


Copies of 

each 
species/cell 


No. of mRNA 
species in 
class 


Mean ^ of 
each species 
in class 


Mean mass 

(ng) of each 
species //ig 
total RNA 


Abundant 


12000 


4 


3.3 


1.65 


Intermediate 


300 


500 


0.08 


0.04 


Rare 


15 


11000 


0.004 


0.002 


Modified from Bertioli et al. (1995). 
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at the 5' end, the final pool of random cDNA fragments is a PCR-renewabie cDNA 
population which is representative of the expressed gene pool and can be used to 
synthesize sense RNA for use as driver material. Furthermore, if the final pool of 
random cDNA fragments is reamplified using biotmylated T7 primer and random 
hexamer, the product can be captured with streptavidm beads and the antisense 
strand eluted for use as tester. Since both target and driver can be generated from 
the same DROP product, subtraction can be performed in both directions (i.e. for 
up- and down-regulated species) between two different DROP products. 

Representational Difference Analysis (RDA) 

RDA of cDNA (Hubank and Schatz 1994) is an extension of the technique 
originally applied to genomic DNA as a means of identifying differences between 
two complex genomes (Lisitsyn et al. 1993). It is a process of subtraction and 
amplification involving subtractive hybridization of the tester in the presence of 
excess driver. Sequences in the tester that have homologues in the driver are 
rendered unamplihable, whereas those genes expressed only m the tester retain the 
ability to be amplified by PCR. The procedure is shown schematically in figure 4. 

In essence, the driver and tester mRNA populations are first converted to cDNA 
and amplified by PCR following the ligation of an adaptor. The adaptors are then 
removed from both populations and a new (different) adaptor ligated to the 
amplified tester population only. Driver and tester populations are next melted and 
hybridized together in a ratio of 100: 1. Following hybridization, only tester : tester 
homohybrids have 5' adaptors at each end of the DNA duplex and can, thus, be filled 
in at both 3' ends. Hence, only these molecules are amplified exponentially during 
the subsequent PCR step. Although tester : driver heterohybnds are present, they 
only amplify in a linear fashion, since the strand derived from the driver has no 
adaptor to which the primer can bind. Driver: driver heterohybrids have no 
adaptors and, therefore, are not amplified. Single stranded molecules are digested 
with mung bean nuclease before a further PCR-ennchment of the tester : tester 
homohybrids. The adaptors on the amplified tester population are then replaced and 
the whole process repeated a further two or three times using an increasing excess of 
driver (Hubank and Shatz used a tester : driver ratio of 1:400, 1:80000 and 
1 : 800000 for the second, third and fourth hybridizations, respectively). Different 
adaptors are ligated to the tester between successive rounds of hybridization and 
amplification to prevent the accumulation of PCR products that might interfere with 
subsequent amplifications. The final display is a series or differentially expressed 
gene products easily observable on an ethidium bromide gel, 

The mam advantages ot RDA are that it offers a reproducible and sensitive 
approach to the analysis of differentially expressed genes. Hubank and Schatz ( 1 994) 
reported that they were able to isolate genes that were differentially expressed in 
substantially less than 1 ° 0 of the cells from which the tester is derived. Perhaps the 
main drawback is that multiple rounds of ligation, hybridization, amplifiation and 
digestion are required. The procedure is, therefore, lengthier than many other 
differential display approaches and provides more opportunity for operator-induced 
error to occur. Although the generation of false positives has been noted, this has 
been solved to some degree by O'Neill and Sinclair f 1 997) through the use of HPLC- 
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Figure 4. The representational difference analysis (RDA) technique. Driver and tester cDNA are 
digested with a 4-cutter restriction enzyme such u Dpnll. The 1" set of 12/24 adaptor strands 
(oligonucleotides) are ligated to each other and the digested cDNA products. The 12mer is 
subsequently melted away and the 3'ends filled in using Taq DNA polymerase. Each cDNA 
population is then amplified using PCR, following which the 1" set of adaptors is removed with 
Dpnll. A second set of 12/24 adaptor strands is then added to the amplified tester cDNA 
population, after which the tester is hybridized against a Targe excess of driver. The 12mer 
adaptors are melted and the 3' ends filled in as before. PCR is carried out with primers identical 
to the new 24mer adaptor. Thus, the only hybridization products which are exponentially 
amplified are those which are tester: tester combinations. Following PCR, ssDNA products are 
removed with mung bean nuclease, leaving the 'first difference product'. This is digested and a 
third set of 12/24 adaptors added before repeating the subtraction process from the hybridization 
stage. The process is repeated to the 3 rt or 4 1 * difference product, as described by Lisitsyn et al. 
(1993) and Hubank and Schatz (1994). -- — - 
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Suppression PCR Subtr active Hybridization . SSH , 

The most recent adaptation of the SH approach to differential expression 
analysis was first described by Diatchenko et ai (1996) and Gurskava et at. ( 1996) 
They reported that a 1000-5000 fold enrichment of rare cDNAs (equivalent to 
isolating mRNAs present at only a few copies per cell) can be obtained without the 
need for multiple hybridizations/subtractions. Instead of physical or chemical 
removal of the common sequences, a PCR-based suppression system is used (see 
figure 5). 

In SSH, excess driver cDNA is added torwo portions of the tester cDNA which 
have been ligated with different adaptors. A first round of hybridization serves to 
enrich differentially expressed genes and equalize rare and abundant messages. 
Equalization occurs since reannealing is more rapid for abundant molecules than tor 
rarer molecules due to the second order kinetics of hybridization (James and Higgins 
1985). The two primary hybridization mixes are then mixed together in the presence 
of excess driver and allowed to hybridize further. This step permits the annealing of 
single stranded complementary sequences which did not hybridize in the primary 
hvbndization, and in doing so generates templates for PCR amplification. Although 
there are several possible combinations of the single stranded molecules present in 
the secondary hybridization mix, only one particular combination (differentially 
expressed in the tester cDNA composed of complimentary strands having different 
adaptors) can amplify exponentially. 

Having obtained the final differential display, two options are available if cloning 
of cDNAs is desired. One is to transform the whole of the final PCR reaction into 
competent cells. Transformed colonies can then be isolated and their inserts 
characterized by sequencing, restriction analysis or PCR. Alternatively, the final 
PCR products can be resolved on a gel and the individual bands excised, reamplified 
and cloned. The first approach is technically simpler and less time consuming. 
However, ligation / transformation reactions are known to be biased towards the 
cloning of smaller molecules, and so the final population of clones will probably not 
contain a representative selection of the larger products. In addition, although 
equalization theoretically occurs, observations in this laboratory suggest that this is 
by no means perfectly accomplished. Consequently, some gene species are present 
in a higher number than others and this will be represented in the final population 
of clones. Thus, in order to obtain a substantial proportion of those gene species that 
acruaily demonstrate differential exDrrssionm the tester populanon. the number or 
clones that will have to be screened alter this step mav be substantial. The second 
approach is initially more time consuming and technically demanding. However, it 
would appear to offer better prospects for cloning larger and low abundance gel 
products. In addition, one can incorporate a screening step that differentiates 
different products of different sequences but of the same size I HA-staining, see 
later). In this way, a good idea of the final number of clones to be isolated and 
identified can be achieved. 

An alternative (or even complementary) approach is to use the final differential 
display reaction to screen a cDNA library to isolate full length clones for further 
characterization, or a DNA array (see later) to quickly identify known genes. SSH 
has been used in this laboratory to begin characterization of the short-term gene 
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Figure 5. PCR-aelect cDNA subtraction. In the primary Jiybndization, an excess of driver cDNA is 
added to each tester cDNA population. The samples are heat denatured and allowed to hybridize 
for between 3 and 8 h. This serves two purposes : (1) to equalize rare and abundant molecules; and 
(2) to ennch for differentially expressed sequences — cDNAs that are not differentially expressed 
form type c molecules with the driver. In the secondary hybridization, the two primary 
hybridizations are mixed together without denaturing. Fresh denatured driver can also be added 
at this point to allow further enrichment of differentially expressed sequences. Type e molecules 
are formed in this secondary hybridization which are subsequently amplified using two rounds of 
PCR. The final products can be visualized on an agarose gel.Jabelled directly or cloned into a 
vector for downstream manipulation. As described by Diatchenko et at. (1996) and Gurskaya 
_ et al. (1996), with permission. 
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which are differentially expressed m rat hver rollowinjj snort term exposure to me enzyme 
inducers, phenobarbital and \Vy-14.b43. 

of expressed genes which are unique to each compound and time/ dose point. Such 
information could be useful in short-term characterization of the toxic potential of 
new compounds by comparing the gene-expression profiles they elicit with those 
produced by known inducers. Figure 6 shows a flow diagram of the method used to 
isolate, verify and clone differentially expressed genes, and figure 7 shows expression 
profiles obtained from a typical SSH experiment. Subsequent sub-cloning of the 
individual bands, sequencing and gene data base interrogation reveals many genes 
'vhirh are either vip- or down -regulated hv phenoharhit a! In rh.p rnt frahle** 2 and ^ 
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Figure 7. SSH display patterns obtained from rat liver following 3-day treatment with \VY- 14,043 or 
phenobarbetal. mRNA extracted from control and treated livers was used to generate the 
differential displays using the PCR-Select cDNA subtraction kit (Clontechi. Lane: 1 — lkb 
ladder: 2 — genes upregulated following Wy. 14-643 treatment; 3 — genes dou nregulated following 
\Vy,14— 643 treatment; 4 — genes upregulated following phenobarbital treatment; 5 — genes 
downregulated following phenobarbital treatment; 6 — lkb ladder. Reproduced from Rockett et 
al. i 1997). with permission. 

exposure, and an almost complete complement of genes are obtained. For example, 
the peroxisome proliferator and non-genotoxic hepatocarcinogen Wy, 14,643, up- 
regulates at least 28 genes and down-regulates at least 15 in the rat (a sensitive 
species) and produces 48 up- and 37 down-regulated genes in the guinea pig, a 
resistant species (Rockett, Swales, Esda and Gibson, unpublished observations). 
One of these genes, CD81, was up-regulated in the rat and down-regulated in the 
guinea pig following Wy- 14,643 treatment. CD81 (alternatively named TAPA-1) is 
a widely expressed cell surface protein which is involved in a large number of cellular 
processes including adhesion, activation, proliferation and differentiation (Levy et 
al. 1998). Since all of these functions are altered to some extent in the phenomena 
ot hepatomegaly and non-genotoxic hep atocaxcmo genes is. it is intriguing, and 
probably mechanistically-relevant, that CD81 expression is differentially regulated 
in a resistant and susceptible species. However, the down-side of this approach is 
that the majority of genes can be sequenced and matched to database sequences, but 
the latter are predominantly expressed sequence tags or genes of completely 
unknown function, thus partially obscuring a realistic overall assessment of the 
critical genes of genuine biological interest. Notwithstanding the lack of complete 
funtional identification of altered gene expression, such gene profiling studies 
essentially provides a 'molecular fingerprint' in response to xenobiotic challenge, 
thereby serving as a mechanistically-relevant platform for further detailed 
investigations. 

Differential Display (DD) 

Originally described as 4 RN A fingerprinting by. arbitrarily primed PC R ' ( Liang 
and Pardee 1992) this method is now more commonly referred to as 'differential 
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Table 2. Genes up-regulated in rat liver tollowing 3-day exposure to phenooarbitai 



Band number 
f approximate 
size i n bp ) 


Highest sequence 

simiiantv 


FASTA-EMBL «ene identification 


5 1 1300) 


93.5% 


CVP2B1 


7 (1000) 


95.1 % 


Preproalbumm 






Serum albumin mRNA 


8 (950) 


98.3 % 


NCI-CGAP-Prl H >apiennE5T) 


10(850) 


95.7% 


CYP2B1 


11 (800) 


Clone 1 94.9% 


CYP2B1 




Clone 2 75.3 % 


CYP2B2 


12 (750) 


93.3% 


TRPM-2 mRNA 






Sulfated glvcoprotein 


15 (600) 


92.9% 


Preproalbumm 






Serum albumin mRNA 


1 6 (55) 


Clone 1 95.2 ° 0 


CYP2B1 




Clone 2 93.6 % 


Haptoglobulm mRNA partial alpha 


21 (350) 


99.3% 


18S. 5.8S & 28S rRNa 



Bands 1—1, 6, 9, 1 3, 14, and 17-20 are shown to be false positives by dot biot anayisis and, therefore, 
are not sequenced. Derived from Rockett et al. (1997). It should be noted that the above ?enes do not 
represent the complete spectrum of genes which are up-regulated in rat liver "by phenobarbital, but 
simply represents the genes sequenced and identified to date. 



Table 3. Genes do 


wn-regulated in rat liver 


following 3-day exposure to phenobarbital. 


Band number 








(approximate 


Highest sequence 


FASTA-EMBL gene identification 


size in bp) 


similarity 


1 (1500) 




95.3% 


3-oxoacyl-CoA thiolase 


2 (1200) 




92.3% 


Hemopoxm mRNA 


3 (1000) 




91.7% 


Alpha-2u-globulin mRNA 


7 (700) 


Clone 1 


77 2 % 


M musculus CI inhibitor 




Clone 2 


94.5% 


Electron transfer rlavoprotein 




Clone 3 


91.0% 


M musculus Topoisomerase 1 (Topo 1) 


8 (650) 


Clone 1 


86.9% 


Soares 2Nb\lT A/ musculus (EST) 




Clone 2 


96.2% 


Alpha-2u-globulin is-tvpe) mRNA 


9 (600) 


Clone I 


86.9% 


Soares mouse NML M. musculus (EST) 




Clone 2 


82.0% 


Soares p3NMF 19.5 M. musculus (EST) 


10 (550) 




73.8% 


Soares mouse NML M musculus (EST) 


1 1 (525) 




95.7% 


NCI-CGAP-Prl H sapiens (EST) 


12 (375) 




100.0 1 , 


Ribosomal protein 


13 (23) 


Clone 1 


97 2 " 


So»rrs mouse cmbno NbME13 5 lEST" 




Clone 2 


1 00 0 


Fibrinogen B-neta-cnain 




Clone 3 


100 0 ' , 


Apolipoprotein E aene 


14 (170) 




% 0% 


Soares p3NMF19 5 A/ muscuius (EST) 


15 (140) 




9" 3% 


Stratagene mouse testis (EST) 


Others: (300) 




96 7 % 


R non-egicus RASP 1 mRNA 


(275) 




93.1% 


Soares mouse mammary gland (EST) 



EST — Expressed sequence tag. Bands 4—6 were shown to be false positives bv dot blot analysis and. 
therefore, were not sequenced. Derived from Rockett et al. ( 1 997). It should be noted that the above genes 
do not represent the complete spectrum of genes which are down-regulated in rat liver by phenobarbital, 
but simiply represent* the genes sequenced and identified to date. 



display ' (DD). In this method, all the mRNA species in the control and treated cell 
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display compared ro the other, are differentially expressed and may be recovered for 
further characterization. One advantage of this system is the speed with which it can 
be earned out — 2 days to obtain a display and as little as a week to make and identify 
clones. 

Two commonly used variations are based on different methods of priming the 
reverse transcription step (figure 8). One is to use an oligo dT with a 2-base 1 anchor ' 
at the 3'-end, e.g. 5' (dT u )CA 3' (Liang and Pardee 1992). Alternatively, an 
arbitrary primer may be used for 1st strand cDNA synthesis (Welsh et al. 1992). 
This variant of RNA fingerprinting has also been called 1 RAP' (RNA Arbitrarily 
Primed)-PCR. One advantage of this second approach is that PCR products may be 
derived from anywhere in the RNA, including open reading frames. In addition, it 
can be used for mRNAs that are not polyadenylated, such as many bacterial mRNAs 
(Wong and McClelland 1994). In both cases, following reverse transcription and 
denaturation, second strand cDNA synthesis is carried out with an arbitrary primer 
{arbitrary primers have a single base at each position, as compared to random 
primers, which contain a mixture of all four bases at each position). The resulting 
PCR, thus, produces a series of products which, depending on the system (primer 
length and composition, polymerase and gel system), usually includes 50-100 
products per primer set (Band and Sager 1989). When a combination of different 
dT-anchors and arbitrary primers are used, almost ail mRNA species from a cell can 
be amplified. When the cDNA products from two different populations are analysed 
side by side on a polyacrylamide gel, differences in expression can be identified and 
the appropriate bands recovered for cloning and further analysis. 

Although DD is perhaps the most popular approach used today for identifying 
differentially expressed genes, it does suffer from several perceived disadvantages : 

(1) It may have a strong bias towards high copy number mRNAs (Bertioli et al. 
1995), although this has been disputed (Wan et al. 1996) and the isolation of very- 
low abundance genes may be achieved in certain circumstances (Guimeraes et 
al. 1995a). 

(2) The cDNAs obtained often only represent the extreme 3' end of the mRNA 
(often the 3 '-untranslated region), although this may not always be the case 
(Guimeraes et al. 1995a). Since the 3' end is often not included in Genbank and 
shows variation between organisms. cDNAs identified by DD cannot always be 
matched with their genes, even if they have been identified. 

[3] The pattern of differential expression seen on the display often cannot be 
reproduced on Northern blots, with false positives arising in up to 70 ° 0 of cases 
(Sun et al. 1994). Some adaptations have been shown to reduce false positives, 
including the use of two reverse transcriptases (Sung and Denman 1997), 
comparison of uninduced and induced celts over a time course (Burn et al. 1994) 
and comparison of DDPCR-products from two uninduced and two induced 
lines (Sompayrac et al. 1995). The latter authors also reported that the use of 
cytoplasmic RNA rather then total RNA reduces false positives arising from 
nuclear RNA that is not transported to the cytoplasm. 

Further details of the background, strengths and weaknesses of the DD 
technique can be obtained from a review Tyy^vTcCIelland al. (1996) and from 
articles by Liang et al. (1995) and Wari et al. (1996). 
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(dTii)CA: AC 



n strand cDNA 
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-AAAAAAAA 

Arbitrary primer 
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<4 



-UGAAAAAAA 
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Denature and synthesise 2^ strand 
*itfi any arbitrary primer ( ) 
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-AC 



2^ strand cDNA 
► 



cDNA can now De amplified by PCR using original primer pair 

Figure 8. Two approaches to differential display (DD) analysts. l sl strand svnthesis can be earned out 
either with a polydT u NN primer (where N = G, C or A) or with an arbitrary primer. The use ot 
different combinations of G. C and A to anchor the first strand polydT primer enables the priming 
of the majonry of polyadenviated mRNAs. Arbitrarv primers mav hybridize at none, one or more 
places along the length of the mRNA. allowing 1" strand cDNA synthesis to occur at none, one 
or more points in the same gene. In both cases. 2 nd strand synthesis is carried out with an arbitrarv 
pnmer. Since these arbitrary primers for the 2 nd strand mav also hybridize to the 1 st strand cDNA 
m a number of different places, several different 2 nd strand products mav be obtained from one 
binding point of the 1" strand primer. Following 2 nd strand svnthesis. the original set of primers 
is used to amplify tne second strand products, wnn the result thai numerous «ene sequences are 
amplified 



Restriction endonuclease-facilitated analysis of gene expression 

Serial Analysts of Gene Expression < SAGE > 

A more recent development in the field of differential display is SAGE analysis 
( Velculescu et al. 1 995). This method uses a different approach to those discussed so 
far and is based on two principles. Firstly, in more than 95 ° 0 of cases, short 
nucleotide sequences ('tags-') of only nine or 10 base pairs provide sufficient 
information to identify their gene of origin. Secondly, concatenation (linking 
together in a series) of these tags allows sequencing of multiple cDNAs within a 
single clone. Figure 9 shows a schematic representation of the SAGE process. In this 
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split into two and different adaptors ligated to the 5' ends of each group. Incorporated 
into the adaptors is a recognition sequence for a type IIS restriction enzyme — one 
which- cuts DNA at a defined distance (< 20 bp) from its recognition sequence. 
Hence, following digestion of each captured cDNA population with the IIS enzyme, 
the adaptors plus a short piece of the captured cDNA are released. The two 
populations are then ligated and the products amplified. The amplified products are 
cleaved with the original anchoring enzyme, religated (concatomers are formed in 
the process) and cloned. The advantage of this system is that hundreds of gene tags 
can be identified by sequencing only a few clones. Furthermore, the number of times 
a given transcript is identified is a quantitative measurement of that gene s 
abundance in the original population, a feature which facilitates identification of 
differentially expressed genes in different cell populations. 

Some disadvantages of SAGE analysis include the technical difficulty of the 
method, a large amount of accurate sequencing is required, biased towards abundant 
mRNAs, has not been validated in the pharmaco/toxicogenomic setting and has 
only been used to examine well known tissue differences to date. 

Gene Expression Fingerprinting GEF j 

A different capture/restriction digest approach for isolating differentially 
expressed genes has been described by Ivanova and Belyavsky (1995). In this 
method, RNA is converted to cDNA using biotmylated oligo(dT) primers. The 
cDNA population is then digested with a specific endonuclease and captured with 
magnetic streptavidm microbeads to facilitate removal of the unwanted 5' digestion 
products. The use of restricted 3 '-ends alone serves to reduce the complexity of the 
cDNA fragment pool and helps to ensure that each RNA species is represented by 
not more than one restriction product. An adaptor is ligated to facilitate subsequent 
amplification of the captured population. PCR is carried out with one adaptor- 
specific and one biotmylated polydT primer. The reamplined population is 
recaptured and the non-biotmylated strands removed by alkaline dissociation. The 
non-biotinylated strand is then resynthesized using a different adaptor-specific 
primer in the presence of a radiolabeled dNTP. The labelled immobilized 3' cDNA 
ends are next sequentially treated with a series of different restriction endonucleases 
and the products from each digestion analysed by PAGE. The resuit is a fingerprint 
composed of a number of ladders i equal to the number of sequential digests used). 
By comparing test versus control fingerprints, it is possible to identify differentially 
expressed products which can then be isolated from the gel and cloned. The 
advantages of this procedure are that it is very robust and reproducible, and the 
authors estimate that 80-93% of cDNA molecules are involved in the final 
fingerprint. The disadvantage is that polyacrylamide gels can rarely resolve more 
than 300-400 bands, which compares poorly to the 1000 or more which are 
estimated to be produced in an average experiment. The use of 2-D gels such as 
those described by Uitterlinden et al. (1989) and Hatada et aL (1991) may help to 
overcome this problem. 

A similar method for displaying restriction endonuclease fragments was later 
described by Prashar _and^ Weissman (19 96V Ho wever, instead of sequential 
digestion of the immobolized 3'-terminal cDNA fragments, these authors simply 
compared the profiles of the control and ^rxeatecf populations without further 
manipulation. _ 
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Figure 9. Serial analvsis of gene expression ( SAGE) analysis. cDNA is cleaved with an anchoring enzyme 
(AE) and the 3 'ends captured using streptavidin beads. The cDNA pool is divided tn half and each 
portion Hgated to a different linker, each containing a rype IIS restriction site (tagging enzvme, 
TEV Restncnon with the type IIS enzyme releases the linker plus a short length of cDNA 
(XXXXX and OOOOO indicate nucleotides of different tags) The two pools of tags are then 
ligated and amplified using linker-specific primers Following PCR. the products are cleaved with 
r he AF and the d'rtfg^ noTarrd fr^rr 'he 'inkers using PAGF T*^r ^"n^^ -^r '^grrd - |!r Tf 
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DNA arrays 

'Open' differential display systems are cumbersome in that it takes a great deal 
of time to extract and identify candidate genes and then confirm that they are indeed 
up- or down-regulated in the treated compared to the control tissue. Normally, the 
latter process is carried out using Northern blotting or RT-PCR. Even so, each of 
the aforementioned steps produce a bottleneck to the ultimate goal of rapid analvsis 
of gene expression. These problems will likely be addressed by the development of 
so-called DNA arrays (e.g. Gress et al. 1992, Zhao et al. 1995, Schena et al. 1996), 
the introduction of which has signalled the next era in differential gene expression 
analysis. DNA arrays consist of a ! gndded membrane or glass chips' containing 
hundreds or thousands of DNA spots, each consisting of multiple copies of part of 
a known gene. The genes are often selected based on previously proven involvement 
in oncogenesis, cell cycling, DNA repair, development and other cellular processes. 
They are usually chosen to be as specific as possible for each gene and animal species. 
Human and mouse arrays are already commercially available and a few companies 
will construct a personalized array to order, for example Clontech Laboratories and 
Research Genetics Inc. The technique is rapid in that hundreds or even thousands 
of genes can be spotted on a single array, and that mRNA/cDNA from the test 
populations can be labelled and used directly as probe. When analysed with 
appropriate hardware and software, arrays offer a rapid and quantitative means to 
assess differences in gene expression between two cell populations. Of course, there 
can only be identification and quantitation of those genes which are in the arrav 
(hence the term 'closed' system). Therefore, one approach to elucidating the 
molecular mechanisms involved in a particular disease /development system may be 
to combine an open and closed system — a DNA array to directly tdentifv and 
quantitate the expression of known genes in mRNA populations, and an open 
system such as SSH to isolate unknown genes which are differentially expressed. 

One of the main advantages of DNA arrays is the huge number of gene fragments 
which can be put on a membrane — some companies have reported gndding up to 
60000 spots on a single glass 'chip' (microscope slide). These high density chip- 
based micro-arrays will probably become available as mass-produced off-the-shelf 
items in the near future. This should facilitate the more rapid determination of 
differential expression in time and dose-response experiments. Aside from their 
high cost and the technical complexities involved in producing arid probing DNA 
arrays, the main problem which remains, especiallv with the newer micro-arrav 
(gene-chip) technologies, is that results are often not wholly reproducible between 
arrays. However, this problem is being addressed and should be resolved within the 
next few years. 



EST databases as a means to identify differentially expressed genes 

Expressed sequence tags (ESTs) axe partial sequences of clones obtained from 
cDNA libraries. Even though most ESTs have no formal identity (putative 
identification is the best to be hoped for), they have proven to be a rapid and efficient 
means of discovering new genes and can be- used to generate profiles of gene- 
expression in specific cells. Since they were first described by Adams et al. (1991), 
there has been a huge explosion in EST production and it is estimated that there are 
now well over a million such sequences in the public domain, representing over half 
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of all human genes {Hillier et al. 1996). This large number of freely available 
sequences (both sequence information and clones are normally available royalty-free 
from the originators) has enabled the development of a new approach towards 
differential gene expression analysis as described by Yasmatzis et al. (1998). The 
approach is simple in theory: EST databases are first searched for genes that have a 
number of related EST sequences from the target tissue of choice, but none or few 
from non-target tissue libraries. Programmes to assist in the assembly of such sets of 
overlapping data may be developed m-house or obtained privately or from the 
internet. For example, the Institute for Genomic Research (TIGR, found at 
http://www.tigr.org) provides many software tools free of charge to the scientific 
community. Included amongst these is the TIGR assembler \ Sutton et al. 1995 ), a 
tool for the assembly of large sets of overlapping data such as ESTs, bacterial 
artificial chromosomes (BAC)s, or small genomes. Candidate EST clones repre- 
senting different genes are then analysed using RNA blot methods for size and tissue 
specificity and, if required, used as probes to isolate and identify the full length 
cDNA clone for further characterization. In practice however, the method is rather 
more involved, requiring biomformatic and computer analysis coupled with 
confirmatory molecular studies. Yasmatzis et al. (1998) have described several 
problems in this fledgling approach, such as separating highly homologous 
sequences derived from different genes and an overemphasis of specificity for some 
EST sequences. However, since these problems will largely be addressed by the 
development of more suitable computer algorithms and an increased completeness 
of the EST database, it is likely that this approach to identifying differentially- 
expressed genes may enjoy more patronage in the future. 



Problems and potential of differential expression techniques 

The holistic or single cell approach ? 

When working with in vivo models of differential expression, one of the first 
issues to consider must be the presence of multiple cell ty pes in anv given specimen. 
For example, a liver sample is likely to contain not only hepatocytes, but also 
(potentially) Ito cells, bile ductule cells, endothelial cells, various immune cells (e.g. 
lymphocytes, macrophages and Kupffer cells) and fibroblasts. Other tissues will 
each trave their own distinctive ceil ooDuiaaons. Also, m tne case or neoplastic tissue, 
mere are almost aiwavs normal, hyperplastic ana or avspiastic ceils present in a 
sample. One must, therefore, be aware that genes obtained from a differential 
display experiment performed on an animal tissue model mav not necessarily arise 
exclusively trom the intended target' cells, e.g. hepatocytes neoplastic cells. If 
appropriate, further analyses using immunohistochemistrv, in situ hybridization or 
in sttu RT-PCR should be used to confirm which cell types are expressing the 
gene(s) of interest. This problem is probably most acute for those studying the 
differential expression of genes in the development of different cell types, where 
there is a need to examine homologous cell populations. The problem is now being 
addressed at the National Cancer Instirate (Bethesda, MD, USA) where new micro- 
disection techniques have been employed to assist in their gene analysis programme, 
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e.g. fluorescence activated cell sorting (FACS) (Dunbar et al. 1998, Kas-Deelen et 
al. 1998) and magnetic bead technology (Richard et al. 1998, Rogler et al. 1998). 

However, those taking a holistic approach may consider this issue unimportant. 
There is an equally appropriate view that all those genes showing altered expression 
withm a compromized tissue should be taken into consideration. After all, since all 
tissues are complex mixes of different, interacting cell rypes which intimatelv 
regulate each other's growth and development, it is clear that each cell type could in 
some way contribute (positively or negatively) towards the molecular mechanisms 
which lie behind responses to external stimuli or neoplastic growth It is perhaps 
then more informative to carry out differential display experiments using in vivo as 
opposed to in vitro models, where uniform populations of identical cells probablv 
represent a partial, skewed or even inaccurate picture of the molecular changes that 
occur. 

The incidence and possible implications of inter-individual biological variation 
should be considered in any approach where whole animal models are being used. It 
is clear that individuals (humans and animals) respond in different ways to identical 
stimuli. One of the best characterized examples is the debnsoquine oxidation 
polymorphism, which is mediated by cytochrome CVP2D6 and determines the 
pharmacokinetics of many commonly prescribed drugs (Lennard 1993, Meyer and 
Zanger 1997). The reasons for such differences are varied and complex, but allelic 
variations, regulatory region polymorphisms and even physical and mental health 
can ail contribute to observed differences in individual responses. Careful thought 
should, therefore, be given to the specific objectives of the study and to the possible 
value of pooling starting material (tissue/mRNA). The effect of this can be 
beneficial through the ironing out of exaggerated responses and unimportant minor 
fluctuations of (mechanistically) irrelevant genes in individual animals, thus 
providing a clearer overall picture of the general molecular mechanisms of the 
response. However, at the same time such minor variations may be of utmost 
importance in deciding the ability* of individual animals to succumb to or resist the 
effects of a given chemical/disease. 



How efficient are differential expression techniques at recovering a high percentage of 
differentially expressed genes ? 

A number of groups have produced experimental data suggesting that mam- 
malian cells produce between 8000-15 000 different mRNA species at any one time 
(Mechler and Rabbitts 1981, Hednck et al. 1984, Bravo 1990), although figures as 
high as 20-30000 have also been quoted (.Axel et al. 1976). Hednck et al. (1984) 
provided evidence suggesting that the majority of these belong to the rare abundance 
class. A breakdown of this abundance distribution is shown in table 1 . 
___When-the results of differential-display experiments have been compared with 
data obtained previously using other methods, it is apparent that not all differentially 
expressed rnRNAs are represented in the final display. In particular, rare messages 
(which, importantly, often include regulatory proteins) are not easily recovered 
using differential display systems. This is a major shortcoming, as the majority of 
mRNA species exist at levels of less than 0.005 %~6f the totaFpopulation (table 1). 
Bertioli et-al. (1995) examined the efrtcienc^oFDD templates (heterogeneous 
mRNA populations) for recovering rare messages and were unable to detect mRNA 
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species present at less than 1.2° 0 of the total mRNA population — equivalent to an 
intermediate or abundant species. Interestingly, when simple model systems ( single 
target only) were used instead of a heterogeneous mRNA population, the same 
primers could detect levels of target mRNA down to 10000 x smaller. These results 
are probably best explained by competition for substrates from the many PCR 
products produced in a DD reaction. 

The numbers of differentially expressed mRNAs reported in the literature using 
various model systems provides further evidence that many differentially expressed 
mRNAs are not recovered. For example, DeRisi et al. (1997) used DNA arrav 
technology to examine gene expression in yeast following exhaustion of sugar m the 
medium, and found that more than 1 700 genes showed a change in expression of at 
least 2-fold. In light of such a finding, it would not be unreasonable to suggest that 
of the 8000-1 5 000 different mRNA species produced by any given mammalian cell, 
up to 1000 or more may show altered expression following chemical stimulation. 
Whilst this may be an extreme figure, it is known that at least 100 genes are 
activated/upregulated in Jurkat (T-) cells following IL-2 stimulation (Oman et al. 
1990). In addition, Wan et al. (1996) estimated that interferon-y-stimulated HeLa 
ceils differentially express up to 433 genes (assuming 24000 distinct mRNAs 
expressed by the cells). However, there have been few publications documenting 
anywhere near the recovery of these numbers. For example, in using DD to compare 
normal and regenerating mouse liver, Bauer et al, (1993) found only 70 of 38000 
total bands to be different. Of these, 50 ° 0 (35 genes) were shown to correspond to 
differentially expressed bands. Chen et aL (1996) reported 10 genes upregulated in 
female rat liver following ethinyl estradiol treatment. McKenzie and Drake (1997) 
identified 14 different gene products whose expression was altered by phorbol 
mynstate acetate (PMA, a tumour promoter agent) stimulation of a human 
myelomonocytic cell line. Kilty and Vickers (1997) identified 10 different gene 
products whose expression was upregulated in the peripheral blood leukocytes of 
allergic disease sufferers. Linskens et al. (1995) found 23 genes differentially 
expressed between young and senescent fibroblasts. Techniques other than DD 
have also provided an apparent paucity of differentially expressed genes. Using SH 
for example, Cao et aL (1997) found 15 genes differentially expressed in colorectal 
cancer compared to normal mucosal epithelium. Fitzpatnck et al. (1995 ) isolated 1 7 
genes upregulated in rat liver following treatment with the peroxisome proliferator. 
clofibrate: Philips et al. fl Q Q0) isolated 12 cDNA clones which were unregulated m 
highly metastatic mammarv adenocarcinoma ceil lines compared to poorlv meta- 
static ones. Prashar and Weissman (1996) used 3 restriction fragment analysis and 
identified approximated 40 genes showing altered expression within 4 h of 
activation of Jurkat T-cells. Groenink and Leegwater tl996) analysed 27 gene 
fragments isolated using SSH of delayed early response phase of liver regeneration 
and found only 12 to be upregulated. 

In the laboratory, SSH was used to isolate up to 70 candidate genes which appear 
to show altered expression in guinea pig liver following short-term treatment with 
the peroxisome proliferator, WY-14,643 (Rockett, Swales, Esdaile and Gibson, 
unpublished observations). However, these findings have still to be confirmed by 
analvsis of the extracted tissue mRNA for differential expression of these sequences. 
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positives), it 1S still not clear if such adaptations are practically effect.ve-provin* 
efficiency by spiking with a known amount of limited numbers of artificial 
construct( s) , s one thing, but isolating a high percentage of the rare messages alreadv 
present ,n an mRNA population ,, another. Of course, some models will genuinely 
produce only a small number of differentially expressed genes. In addition, there are 
also technical problems that can reduce efficiency. For example. mRNAs mav have 
an unusual primary structure that effectively prevents the.r amplification bvPCR- 
based systems In addition. ,t is known that under certain circumstances" not all 
mRNAs have 3 polyA sites. For example, dur.ng Xenopus development, deadenvl- 
ation ,s usea as a means to stabilize RNAs (Voeltz and Ste.tz 1998) whilst 
preferential deadenylation may play a role ,n regulatm e HspTO (and perhaps 
therefore, other stress protein) expression in Drosophxla ( Dellavalle et al 1994) The 
presence of deadenylated mRNAs would clearly reduce the efficiency of svstems 
utilizing a polydT reverse transcription step. The efficiencv of anv'svstem also 
depends on the quality of the starting material. All differential displav techniques 
use mRNA as their target material. However, it ,s difficult to isolate mRNA that is 
completely free of ribosomal RNA. Even if polydT pnmers are used to prime first 

rr, pA>T; ' nbOSOmal RNA 15 ° ften transcr 'bed to some degree 

Clontech PCR-Se lect cDXA Subtraction kit user manual,. It has been shown at 
least ,n the case of SbH, that a high rRNA:mRNA ratio can lead to inefficient 
subtracts hybridization (Clontech PCR-Select cDNA Subtraction kit user 
manual) and there is no reason to suppose that it will not do likewise in other SH 
approaches Finally, those techniques that utilise a presubtraction amplification step 
(e.g. RDA) may present a skewed representation since some sequences amplify 
better than others. K • 

Of course, probably the most important consideration is the temporal factor It 
•s clear that any given differential display experiment can onlv interrogate a cell at 
one point in time. It may well be that a high percentage of the genes showing altered 
expression at that time are obtained. However, given that disease processes and 
responses to environmental stimuli involve dynamic cascades of signalling 
regulation production and action. ,t is clear that all those genes which are switched 
on/off at different times will not be recovered and. therefore, vital information mav 
well be missed It is. therefore, imperative to obtain as much information about the 
model system beforehand as possible, from which a strategy can be derived tor 
targeting specitic rime points or events that are of particular interest to the 
investigator. One way of getting round this problem of single time po.nt analvs.s is 
to conduct the experiment over a suitable time course which, of course " adds 
substantially to the amount of work involved. 



How sensitive are differential expression technologies ? 

There has been little published data that addresses the issue of how large the 
change m express.on must be for it to permit isolation of the gene in quest.on with 
the various duTerential expression technologies. Although the isolation of genes 
whose express.on is changed as little as_L5-fold has been reported using SSH 
(Groerunk and Leegwater-1996), ,t appears-fWat those demonstrate a change in 
excess of 5-fold . are_m Q re.likdy *>. b^cked-^p.-Thus, there ,s a 'grev zone' 
in between where small changes could fade ,n and out of isolation between 
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experiments and animals. DD U on the other hand, is not subject to this grev 
zone since, unlike SH approaches, it does not amplify the difference in expression 
between two samples. Wan et al. (1996) reported that differences in expression of 
twofold or more are detectable using DD. 



Resolution and visualization oj differential expression products 

It seems highly improbable with current technology that a gel system could be 
developed that is able to resolve all gene species showing altered expression in anv 
given test system (be it SH- or DD-based). Polyacrylamide gel electrophoresis 
(PAGE) can resolve size differences down to 0.2 ° 0 (Sambrook et al. 1989) and are 
used as standard in DD experiments. Even so, it is clear that a complex series of gene 
products such as those seen in a DD will contain unresolvable components. Thus, 
what appears to be one band in a gel may in fact turn out to be several. Indeed, it has 
been well documented (Mathieu-Dau.de et al. 1996 Smith Pi fll 1 997") rhir a cmnrl** 
band extracted from a DD often represents a composite of heterogeneous products, 
and the same has been found for SSH displays in this laboratory (Rockett et al. 
1997). One possible solution was offered by Mathieu-Daude et al. (1996), who 
extracted and reamplified candidate bands from a DD display and used single strand 
conformation polymorphism (SSCP) analysis to confirm which components 
represented the truly differentially expressed product. 

Many scientists often try to avoid the use of PAGE where possible because it is 
technically more demanding than agarose gel electrophoresis (AGE). Unfortunatelv, 
high resolution agarose gels such as Metaphor (FMC, Lichfield, UK) and AquaPor 
HR (National Diagnostics, Hessle, UK), whilst easier to prepare and manipulate 
than PAGE, can only separate DNA sequences which differ in size by around 
1.5-2 ° 0 (15-20 base pairs for a 1Kb fragment). Thus, SSH, RDA or other such 
products which differ in size by less than this amount are normally not resolvable. 
However, a simple technique does in fact exist for increasing the resolving power of 
AGE — the inclusion of HA-red (lO-phenyl neutral red-PEG ligand) or HA-yellow 
(bisbenzamide-PEG ligand) (Hanse Analytik GmbH, Bremen, Germany) in a 
gel separates identical or closely sized products on base content. Specifically, 
HA-red and -yellow selectively bind to GC and AT DNA motirs, respectively 
Wawer et al. 1995, Hanse Anaivtik ! oc >7. personal communication). Since both 
HA-stams possess an overall positive coarse, thev migrate :owards tne catnoae 
when an electric field is applied, This is in direct opposition to DNA. which 
is negatively charged and, therefore, migrates towards the anode. Thus, if two 
DNA clones are identical in size ias perceived on a standard high resolution 
agarose gel), but differ in AT.GC content, inclusion of a HA-dye in the gel 
will effectively retard the migration of one of the sequences compared to the 
other, effectively making it apparently larger and, thus, providing a means of 
differentiating between the two. The use of HA-red has been shown to resolve 
sequences with an AT variation of less than 1 ° 0 (Wawer et al. 1995), whilst Hanse 
Analytik have reported that HA staining is so sensitive that in one case it was used 
to distinguish two 567bp sequences which differed by only a single point mutation 
' Hanse Anaivtik 1 n r; nr,i! rn^rr/r-i^opi T^-'' f " f " *" - A .>^<: - u f > ( . t- 
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Figure 10 Discrimination of clones of identical /nearly identical size using HA-red. Bands of decreasing 
size (1-5) were extracted from the final display of a suppression subcracnve hybridization 
experiment and cloned. Seven colonies were picked at random from each cloned band and their 
inserts amplined using PCR. The products were run on two gels. (A) a high resolution 2 ° 0 agarose 
gel, and ^B) a high resolution 2 ° 0 agarose gel containing 1 L* , ml HA-red. With tew exceptions, all 
the clones from each band appear to be the same size (gel A). However, the presence of HA-red 
i gel B), which separates identically-sized DNA fragments based on the percentage of GC within 
the sequence, clearly indicates the presence of different gene species within each band. For 
example, even though all five re-amplified clones of band 1 appear to be the same size, at least four 
different gene species are represented. 



in a similar gel containing one of the HA-stains. The standard gel should indicate 
any gross size differences, whilst the HA-stained gel should separate otherwise 
unresolvable species (on standard AGE) according to their base content. Geisinger 
et al. (1997) reported successful use of this approach for identifying DD-denved 
clones. Figure 10 shows such an experiment carried out in this laboratory on clones 
obtained from a band extracted from an SSH display. 

An alternative approach is to carry out a 2-D analysis of the differential display 
products. In this approach, size-based separation is hrst carried out in a standard 
agarose gel. The gel slice containing the display is then extracted and incorporated 
in to a HA gel for resolution based on AT/GC content. 

Of course, one should always consider the possibility of there being different 
gene species which are the same size and have the same GC/AT content. However, 
even these species are not unresolvable given some effort— again, one might use 
SSCP, or perhaps a denaturing gradient gel electrophoresis (DGGE) or temperature 
gradient field electrophoresis (TGGE) approach to resolve the contents of a band, 
either directly on the extracted band (Suzuki et al. 1991) or on the reamplified 
product. 

The requirement of some differential display techniques to visualize large 
numbers of products (e.g. DD and GEF) can also present a problem in that, in terms 
of numbers, the resolution of PAGE rarely exceeds 300-400 bands. One approach to 
overcoming this might be to use l-D^eis sucrTa"5 Tta>se~descnbed bv Uitterlinden et 
al. (1989) and Hatada «t a/. (1991). , 
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Extraction of differentially expressed bands from a gel can be complex since, in 
some cases (e.g. DD, GEF), the results are visualized by autoradiographic means, 
such that precise overlay of the developed film on the gel must occur if the correct 
band is to be extracted for further analysis. Clearly, a misjudged extraction can 
account for many man-hours lost. This problem, and that of the use of radioisotopes, 
has been addressed by several groups. For example, - Lohmann et at. (1995) 
demonstrated that silver staining can be used directly to visualize DD bands in 
horizontal PAGs. An et ai (1996) avoided the use of radioisotopes bv transferring a 
small amount (20-30 ° 0 ) of the DNA from their DD to a nylon membrane, and 
visualizing the bands using chemiluminescent staining before going back to extract 
the remaining DNA from the gel. Chen and Peck (1996) went one step further and 
transferred the entire DD to a nylon membrane. The DNA bands were then 
visualized using a digoxigenin (DIG) system (DIG was attached to the polydT 
primers used in the differential display procedure). Differentially expressed bands 
were cut from the membrane and the DNA eluted by washing with PCR buffer prior 
ro reampiincation. 

One of the advantages of using techniques such as SSH and RDA is that the final 
display can be run on an agarose gel and the bands visualized with simple ethidium 
bromide staining. Whilst this approach can provide acceptable results, overstaining 
with SYBR Green I or SYBR Gold nucleic acid stains (FMC) effectively enhances 
the intensity and sharpness of the bands. This greatly aids in their precise extraction 
and often reveals some faint products that may otherwise be overlooked. Whilst 
differential displays stained with SYBR Green I are better visualized using short 
wavelength UV (254 nm) rather than medium wavelength (306 nm), the shorter 
wavelength is much more DNA damaging. In practice, it takes only a few seconds 
to damage DNA extracted under 254 nm irradiation, effectively preventing 
reamplification and cloning. The best approach is to overstain with SYBR Green I 
and extract bands under a medium wavelength UV transillumination. 



The possible use of ' microfingerprinting' to reduce complexity 

Given the sheer number of gene products and the possible complexity of each 
band, an alternative approach to rapid characterization may be to use an enhanced 
analysis ot a small section of a differential displav — a ' sub-rineerpnnt ' or 'micro- 
ringerpnnt '. In this case, one couia concentrate on tnose banas wnich oniy appear 
m a particular chosen size region. Reducing the ringerpnnt m mis wav nas at least 
rwo advantages. One is that it should be possible to use different gel types, 
concentrations and run times tailored exactly to that region. Currently, one might 
run products from 100-3000 * bp on the same gel. which leads to compromize in the 
gel system being used and consequently to suboptimal resolution, both in terms of 
size and numbers, and can lead to problems in the accurate excision of individual 
bands. Secondly, it may be possible to enhance resolution by using a 2-D analysis 
using a HA-stain, as described earlier. In summary, if a range of gene product sizes 
is carefully chosen to included certain ' relevant ' genes, the 2-D system standardized, 
and appropriate gene analysis used, it may be possible to develop a method for the 
early and rapid identification of compounds which have similar or widely different 
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An alternative approach to microfingerpnntmg is to examine altered expression 
in specihc families of genes through careful selection of PCR primers and/or post- 
reaction analysis. Stress genes, growth factors and/or their receptors, cell cycling 
genes, cytochromes P450 and regulatory proteins might be considered as candidates 
for analysis in this way. Indeed, some off-the-shelf DNA arravs (e.g. Clontech's 
Atlas cDNA Expression Array series) already anticipated this to some degree re- 
grouping together genes involved in different responses e.g. apoptosis. stress, DN A- 
damage response etc. 



Screening 

False positives 

The generation of false positives has been discussed at length amongst the 
differential display community ( Liang et al. 1993, 1995, Xishio^a/. 1994, Sun et al. 
1994, Sompayrac et al. 1995). The reason for false positives varies with the 
technique being used. For instance, m RDA, the use of adaptors which have not 
been HPLC purified can lead to the production of false positives through illegitimate 
ligation events (O'Neill and Sinclair 1997). whilst m DD they can arise through 
PCR artifacts and illegitemate transcription of rRNA. In SH, false positives appear 
to be derived largely from abundant gene species, although some may arise from 
cDXA/mRNA species which do not undergo hybridization for technical reasons. 

A quick screening of putative differentially expressed clones can be carried out 
using a simple dot blot approach, in which labelled first strand probes svnthesized 
from tester and driver mRNA are hybridized to an array of said clones (Hednck et 
al. 1984, Sakaguchi et al. 1986). Differentially expressed clones will hybridize to 
tester probe, but not driver. The disadvantage of this approach is that rare species 
may not generate detectable hybridization signals. One option for those using SSH 
is to screen the clones using a labelled probe generated from the subtracted cDXA 
from which it was derived, and with a probe made from the reverse subtraction 
reaction (ClonTechniques 1997a). Since the SSH method enriches rare sequences, 
it should be possible to confirm the presence of clones representing low abundance 
genes. Despite this quick screening step, there is still the need to go back to the 
original mRNA and confirm the altered expression usine a more quantitative 
approach. .Although this may be achieved using Northern blots, the sensitivity is 
poor by today's high standards and one must rely on PCR methods for accurate and 
sensitive determinations (see below). 



Sequence analysis 

The majority of differential display procedures produce final products which are 
between 100 and lOOObp in size. However, this may considerably reduce the size of 
the sequence for analysis of the DNA databases. This in turn leads to a reduced 
confidence in the result — several families of genes have members whose DNA 
-sequences are almost identical -except in ST fe w ker stretches, e.g. the cytochrome 
P450 gene superfamily (Nelson et a/._1996). Thus, does the clone identified as being 
almost identical to gene X<, really come from that gene, or its brother gene X l or its 
as yet undiscovered sister X 2 ? For example, using SSH, part of a gene was isolated, 
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which was up-regulated in the liver of rats exposed to Wy- 1 4,643 and was identihed 
by a FASTA search as being transferrin (data not shown). However, transferrin is 
known to be downregulated by hypolipidemic peroxisome proliferators such as Wv- 
14,643 (Hertz et al. 1996), and this was confirmed with subsequent RT-PCR 
analysis. This suggests that the gene sequence isolated may belong to a gene which 
is closely related to transferrin, but is regulated by a different mechanism. 

A further problem associated with SH technology is redundancv. In most cases 
before SH is carried out, the cDNA population must first be simplified by restriction 
digestion. This is important for at least two reasons: 

(1) To reduce complexity— long cDNA fragments may form complex networks 
which prevent the formation of appropriate hybrids, especially at the high 
concentrations required for efficient hybridization. 

[2) Cutting the cDNAs into small fragments provides better representation of 
individual genes. This is because genes derived from related but distinct 
members of gene families often have similar coding sequences that mav cross- 
hybridize and be eliminated during the subtraction procedure (Ko 1990). 
Furthermore, different fragments from the same cDNA may differ considerably 
in terms of hybridization and amplification and, thus, may not efficiently do one 
or the other (Wang and Brown 1991 ). Thus, some fragments from differentially 
expressed cDNAs may be eliminated during subtractive hybridization pro- 
cedures. However, other fragments may be enriched and isolated. As a 
consequence of this, some genes will be cut one or more times, giving rise to two 
or more fragments of different sizes. If those same genes are differentially 
expressed, then two or more of the different size fragments may come through 
as separate bands on the final differential display, increasing the observed 
redundancy and increasing the number of redundant sequencing reactions. 

Sequence comparisons also throw up another important point— at what degree 
of sequence similarity does one accept a result. Is 90 ° 0 identitiv between a gene 
derived from your model species and another acceptably close 1 Is 95 0 o between 
your sequence and one from the same species also acceptable : This problem is 
particularly relevant when the forward and reverse sequence comparisons give 
similar sequences with completely different gene species' An arbitrarv decision 
seems to be to allocate genes mat are aerimte <^5 and above simiianrvi and then 
eroup those between oO and ^5 , as Deine related or possible nomoioeues. 
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Quantitative analysis 

At some point, one must give consideration to the quantitative analysis of the 
candidate genes, either as a means of confirming that they are truly differentially 
expressed, or in order to establish just what the differences are. Northern blot 
analysis is a popular approach as it is relatively easy and quick to perform. However, 
the major drawback with Northern blots is that they are often not sensitive enough 
to detect rare sequences. Since the majority of messages expressed in a cell are of low 
abundance (see table 1 ), this is a major problem. Consequently, RT-PCR mav be the 



684 



J. C. Rockett et al. 



appropriate thermal cycling technology. Whilst quantitative analysis is more 
desirable, being more accurate and without reliance on an internal standard, the 
money and time needed to develop a competitor molecule is often excessive, 
especially when one might be examining tens or even hundreds of gene species. The 
use of semi-quantitative analysis is simpler, although still relatively involved. One 
must first of all choose an internal standard that does not change in the test cells 
compared to the controls. Numerous reference genes have been tried in the past, for 
example mterferon-gamma (IFN-r, Frye et al. 1989), 0-actin (Heuval et al. 1994), 
glyceraldehyde-3-phosphate dehydrogenase (GAPDH, Wong et al. 1994), di- 
hydrofolate reductase (DHFR, Mohler and Butler 1991), /^-microglobulin [0-2- 
m, Murphy et al. 1990), hypoxanthine phosphonbosyl transferase (HPRT, Foss et 
al. 1998) and a number of others ( ClonTechniques 1997b). Ideally, an internal 
standard should not change its level of expression in the cell regardless of cell age, 
stage in the cell cycle or through the effects of external stimuli. However, it has been 
shown on numerous occasions that the levels of most housekeeping genes currently 
used by the research community do in fact change under certain conditions and in 
different tissues (ClonTechniques 1997b). It is imperative, therefore, that pre- 
liminary experiments be earned out on a panel of housekeeping genes to establish 
their suitability for use in the model system. 

Interpretation of quantitative data must also be treated with caution. By 
comparing the lists of genes identified by differential expression one can perhaps 
gam insight into why two different species react in different ways to external stimuli. 
For example, rats and mice appear sensitive to the non-genotoxic effects of a wide 
range of peroxisome proliferators whilst Syrian hamsters and guinea pigs are largely 
resistant (Orton et al. 1984, Rodncks and Turnbull 1987, Lake et al. 1989, 1993, 
Makowska et al. 1992). A simplified approach to resolving the reason(s) why is to 
compare lists of up- and down-regulated genes in order to identify those which are 
expressed in only one species and, through background know ledge of the effects of 
the said gene, might suggest a mechanism of facilitated non-genotoxic carcinogenesis 
or protection. Of course, the situation is likely to be far more complex. Perhaps if 
there were one key gene protecting guinea pig from non-genotoxic effects and it was 
upregulated 50 times by PPs. the same gene might only be up-regulated five times 
in the rat. However, since both were noted to be upregulated. the importance of the 
gene may be overlooked. Just to complicate matters, a large cnange in expression 
does not necessarily mean a biologically important change. For example, what is the 
true relevance of gene Y which shows a 50-foid increase after a particular treatment, 
and gene Z which shows only a 5-fold increase:- If one examines the literature one 
may find that historically, gene Y has often been shown to be up-regulated 40-60- 
fold by a number of unrelated stimuli— in light of this the 50-fold increase would 
appear less significant. However, the literature may show that gene Z has never been 
recorded as having more than doubled in expression — which makes your 5-fold 
increase all the more exciting. Perhaps even more interesting is if that same 5-fold 
increase has only been seen in related neopraimsTor following treatment with related 
chemicals. 

Problems in using the differential display approach 

Differential display technology originally held promise of an easily obtainable 
'fingerprint' of those genes which are up- or down-regulated in test animals/cells in 
a developmental process or following exposure to given stimuli. However, it has 
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become clear that the fingerprinting process, whilst still valid, is much too complex 
to be represented by a single technique profile. This is because all differential display 
techniques have common and/or unique technical problems which preclude the 
isolation and identification of all those genes which show changes in expression. 
Furthermore, there are important genetic changes related to disease development 
which differential expression analysis is simply not designed to address. An example 
of this is the presence of small deletions, insertions, or point mutations such as those 
seen in activated oncogenes, tumour suppressor genes and individual poly- 
morphisms. Polymorphic variations, small though they usually are, are often 
regarded as being of paramount importance in explaining why some patients 
respond better than others to certain drug treatments (and, in logical extension, why 
some people are less affected by potentially dangerous xenobiotics/ carcinogens than 
others). The identification of such point mutations and naturally occurring 
polymorphisms requires the subsequent application of sequencing, SSCP, DGGE 
or TGGE to the gene of interest. Furthermore, differential display is not designed 
to address issues such as alternatively spliced gene species or whether an increased 
abundance of mRNA is a result of increased transcription or increased mRXA 
stability. 
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Conclusions 

Perhaps the mam advantage of open system differential display techniques is that 
they are not limited by extant theories or researcher bias in revealing genes which are 
differentially expressed, since they are designed to amplify all genes which 
demonstrate altered expression. This means that they are useful for the isolation of 
previously unknown genes which may turn out be useful biomarkers of a particular 
state or condition. At least one open system (SAGE) is also quantitative, thus 
eliminating the need to return to the original mRXA and carry out Northern/ PCR 
analysis to confirm the result. However, the rapid progress of genome mapping 
projects means that over the next 5-10 years or so, the balance of experimental use 
will switch from open to closed differential display systems, particularly DNA 
arravs. Axravs are easier and faster to prepare and use, provide quantitative data, are 
suitable for high throughput analvsis ana can be tailored to look at specific signalling 
pathways or families of genes. Identification of ail the gene sequences in human and 
common laboratory animals combined with improved DNA array technology, 
means that it will soon no longer be necessary to try to isolate differentially expressed 
genes using the technically more demanding open system approach. Thus, their 
jmain advantage (that of identifying unknown genes) will be largely eradicated. It is 
likely, therefore, that their sphere of application will be reduced to analysis of the 
less common laboratory species, since it will be some time yet before the genomes of 
such animals as zebrafish, electric eels, gerbils, crayfish and squid, for example, will 
be sequenced. 

Of course, in the end the question will always remain: What is the functional/ 
biological significance of the identified, differentials expressed genes ' One 
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carcinogenic effect. Whilst differential display technology cannot hope to answer 
these questions, it does provide a springboard from which identification, regulatory 
and functional studies can be launched. Understanding the molecular mechanism of 
cellular responses is almost impossible without knowing the regulation and function 
of those genes and their condition (e.g. mutated). In an abstract sense, differential 
display can be likened to a still photograph, showing details of a fixed moment m 
time. Consider the Historian who knows the outcome of a battle and the placement 
and condition of the troops before the battle commenced, but is asked to try and 
deduce how the battle progressed and why it ended as it did from a tew still 
photographs— an impossible task. In order to understand the battle, the Historian 
must find out the capabilities and motivation of the soldiers and tneir commanding 
officers, what the orders were and whether they were obeyed. He must examine the 
terrain, the remains of the battle and consider the effects the prevailing weather 
conditions exerted. Likewise, if mechanistic answers are to be forthcoming, the 
scientist must use differential display in combination with other techniques, such as 
knockout technology, the analysis of cell signalling pathwavs, mutation analysis and 
time and dose response analyses. Although this review has emphasized the 
importance of differential gene profiling, it should not be considered in isolation and 
the full impact of this approach will be strengthened if used in combination with 
functional genomics and proteomics (2-dimensional protein gels from isoelectric 
focusing and subsequent SDS electrophoresis and virtual 2D-maps using capillary 
electrophoresis). Proteomics is attracting much recent attention as many of the 
changes resulting in differential gene expression do not involve changes in mRNA 
levels, as decnbed extensively herein, but rather protein-protein. protein-DNA and 
protein phosphorylation events which would require functional genomics or 
proteomic technologies for investigation. 

Despite the limitations of differential display technology, it is clear that many 
potential applications and benefits can be obtained from characterizing the genetic 
changes that occur in a cell during normal and disease development and in response 
to chemical or biological insult. In light of functional data, such profiling will 
provide a 4 fingerprint' of each stage of development or response, and in the long 
term should help in the elucidation of specific and sensitive biomarkers for different 
types of chemical /biological exposure and disease states. The potential medical and 
therapeutic benefits of understanding such molecuiar changes are almost im- 
measurable. .Amongst other things, such fingerprints could indicate the familv or 
even specific type of chemical an individual has been exposed to plus the length 
and/or acuteness of that exposure, thus indicating the most prudent treatment. 
They may also help uncover differences in histologically identical cancers, provide 
diagnostic tests for the earliest stages of neop 1 as lajincL again, perhaps indicate the 
most efficacious treatment. 

The Human Genome Project will be completed early in the next centurv and the 
DNA sequence of all the human genes will be known. The continuing development 
and evolution of differential gene expression technology will ensure that this 
knowledge contributes fully to the understanding of human disease processes. 
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ABSTRACT The recent ability to sequence whole genomes 
allows ready access to all genetic material. The approaches 
outlined here allow automated analysis of sequence for the 
synthesis of optimal primers in an automated multiplex 
oligonucleotide synthesizer (AMOS). The efficiency is such 
that all ORFs for an organism can be amplified by PCR. The 
resulting amplicons can be used directly in the construction of 
DNA arrays or can be cloned for a large variety of functional 
analyses. These tools allow a replacement of single-gene 
analysis with a highly efficient whole-genome analysis. 



The genome sequencing projects have generated and will 
continue to generate enormous amounts of sequence data. The 
genomes of Saccharomyces cerevisiae, Escherichia coh, Hae- 
mophilus influenzae ( 1 ), Mycoplasma gemtaltum (2). and Meth- 
anococcus jannaschu (3) have been completely sequenced. 
Other model organisms have had substantial portions of their 
genomes sequenced as well, including the nematode Caeno- 
rhabdms elegans (4) and the small flowering plant Arabtdopsis 
thahana (5). This massive and increasing amount of sequence 
information allows the development of novel experimental 
approaches to identify gene function. 

One standard use of genome sequence data is to attempt to 
identify the functions of predicted open reading frames 
(ORFs) within the genome by comparison to genes of known 
function Such a comparative analysis of all ORFs to existing 
sequence data is fast, simple, and requires no experimentation 
and is therefore a reasonable first step. While finding sequence 
homologies/motifs is not a substitute for experimentation, 
noting the presence of sequence homology and/or sequence 
motifs can be a useful first step in finding interesting genes, in 
designing experiments and, in some cases, predicting function 
However, this type of analysis is frequently un informative For 
example, over one-half of new ORFs in S. cerevisiae have no 
known function (6). If this is the case in a well studied organism 
such as veast. the problem will be even worse in organisms thai 
are less well studied or less manipulate. A large, experimen- 
tal determined gene function database would make homol- 
ogy motif searches much more useful. 

Experimental analysts must be performed to thoroughly 
understand the biological function of a gene product. Scaling 
up from classical cottage industry' one-gene-oriented ap- 
proaches to whole-genome analysis would be verv expensive 
and laborious. It is clear that novel strategies are necessary to 
efficiently pursue the next phase of the genome projects— 
whole-genome experimental analysis to explore gene expres- 
sion, gene product function, and other genome functions 
Mndel oreamsms such as S rrrrviuar will be t>xtremH\ 



important in the development of novel whole-genome analysis 
techniques and. subsequently, in improving our understanding 
of other more complex and less manipulable organisms. 

The genome sequence can be systematically used as a tool 
to understand ORFs, gene product function, and other ge- 
nome regions. Toward this end. a directed strategy has been 
developed for exploiting sequence information as a means of 
providing information about biological function (Fig 1). Ef- 
forts have been directed toward the amplification of each 
predicted ORF or any other region of the genome ranging 
from a few base pairs to several kilobase pairs. There are many 
uses for these amplicons — they can be cloned into standard 
vectors or specialized expression vectors, or can be cloned into 
other specialized vectors such as those used for two-hvbrid 
analysis. The amplicons can also be used directly by, for 
example, arraying onto glass for expression analysis, for DNA 
binding assays, or for any direct DNA assay (7), As a pilot 
study, synthetic primers were made on the 96-weII automated 
multiplex oligonucleotide synthesizer (AMOS) instrument (8) 
(Fig. 2). These oligonucleotides were used to amplify each 
ORF on yeast chromosome V. The current version of this 
instrument can synthesize three plates of 96 oligonucleotides 
each (25 bases) in an 8-hr day. The amplification of the entire 
set of PCR products was then analyzed by gel electrophoresis 
(Fig. 3) Successful amplification of the proper length product 
on the first attempt was 95 This project demonstrates that 
one can go directly from sequence information to biological 
analysis in a truly automated, totally directed manner. 

These amplicons can be incorporated directly in arrays or 
the amplicons can be cloned. If the amplicons are to be cloned, 
novel sequencer can be incorporated at the 5' end of the 
oligonucleotide to facilitate cloning. One potential problem 
with cloning PCR products is that the cloned amplicons may 
contain sequence alterations that dimmish their utility. One 
option would he to resequence each individual amphcon. 
However, this is expensive, inefficient, and time consuming. A 
faster, more cost-el lecti\ e. and more accurate approach is to 
appiv comparative sequencing by denaturing HPLC (9|. This 
method is capable ol detecting a single base change m a 2-kb 
heteroduplex Longer amplicons can be analyzed b\ use of 
appropriate restriction fragments If am change is detected in 
a clone, an alternate clone of the same region can be analyzed. 
Modifying the system to allow high throughput analysis by 
denaturing HPLC is also relatively simple and straightforward. 

If amplicons are used directly on arravs without cloning, it 
is important to note that, even if single PCR product bands are 
observed on gels, the PCR products v, ill be contaminated with 
various amount*, ol other sequences This contamination has 
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Fig 1 Overview of systematic method for isolating individual 
genes Sequence information is obtained automatically from sequence 
databases The data are input into primer selection software specifi- 
cally designed to target ORFs as designated by database annotations 
The output file containing the primer information is directly read by 
a high-throughput oligonucleotide synthesizer, which makes the oli- 
gonucleotides in 96-wcll plates (AMOS, automated multiplex oligo- 
nucleotide synthesizer). The forward and reverse primers are synthe- 
sized in the same location on separate plates to facilitate the down- 
stream handling of primers The amphcons are generated by PCR in 
96-well plates as well 

analysis. On the other hand, direct use of the amplicons is 
much less labor intensive and greatly decreases the occurrence 
of mistakes in clone identification, a ubiquitous problem 
associated with large clone set archiving and retrieving. 

Any large-scale effort to capture each ORF within a genome 
must rely on automation if cost is to be minimized while 
efficiency is maximized. Toward that end, primers targeting 
ORFs were designed automatically using simple new scripts 
and existing primer selection software. These script-selected 
primer sequences were directly read by the high-throughput 
synthesizer and the forward and reverse primers were synthe- 
sized in separate plates in corresponding wells to facilitate 
automated pipetting and PCR amplifications. Each of the 
resulting PCR products, generated with minimum labor, con- 
tains a known, unique ORF. 

Large-scale genome analysis projects are dependent on 
newly emerging technologies to make the studies practical and 
economically feasible. For example, the cost of the primers, a 
significant issue in the past, has been reduced dramatically to 
make feasible this and other projects that require tens of 
thousands of oligonucleotides. Other methods of high- 
throughput analysis are also vital to the success of functional 
analysis projects, such as microarraying and oligonucleotide 
chip methods ( 10-14). 

Changes in attitude are also required. One of the major costs 
of commercial oligonucleotides is extensive quality control 
such that virtually lOO'vt of the supplied oligonucleotides are 
successfully synthesized and work for their intended purpose 
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Fig 2 Overall approach for using database of a genome to direct 
biological analysis The synthesis of the 6.000 ORFs (orfs) for each 
gene of S ccrexisiae can be used in many applications utilizing both 
cloning and microarraying technology 

Considerable cost reduction can be obtained by simply de- 
creasing the expected successful synthesis rate to 95-979c. One 
can then achieve faster and cheaper whole genome coverage by 
simply adding a single quality control at the end of the 
experiment and batching the failures for resynthesis. 

The directed nature of the amplicon approach is of clear 
advantage. The sequence of each ORF is analyzed automati- 
cally, and unique specific primers are made to target each 
ORF. Thus, there is relatively little time or labor involved — for 
example, no random cloning and subsequent screening is 
required because each product is known. In the test system, 
primers for 240 ORFs from chromosome V were systematically 
synthesized, beginning from the left arm and continuing 
through to the right arm. At no point was there any manual 
analysis of sequence information to generate the collection. In 
many wavs, now that the sequence is known, there is no need 
for the researcher to examine it. 

These amplicons can be arraved and expression analysis can 
be done on all arraved ORFs with a single hvbridization (10). 
Those ORFs that display significant differential expression 
patterns under a given selection are easily identified without 
the laborious task of searching for and then sequencing a clone. 
Once scaled up. the procedure provides even greater returns 
on effort, because a single hybridization will ultimately provide 
a snapshot" of the expression of all genes in the veast genome. 
Thus, the limiting factor in whole genome analvsis will not be 
the analysis process itself, but will instead be the ability of 
researchers to design and carry out experimental selections. 

Current expression and genetic analysis technologies are 
geared toward the analvsis of single genes and are ill suited to 
analvze numerous genes under manv conditions Additional 
difficulties with current technologies include: the effon and 
expense required to analyze expression and make mutants, the 
potential duplication of effort if done by different laboratories, 
and the possibility of conflicting results obtained from differ- 
ent laboratories. In contrast, whole genome analysis not only 
is more efficient, it also provides data of much higher quality; 
all genes are assayed and compared in parallel under exactly 
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^ Fig. 3. Gel image of amplifications. Using the method described m Fig 
V One plate of 96 amplification reactions is shown 

the same conditions. In addition, amplicons have many appli- 
cations beyond gene expression. For example, one recent 
approach is to incorporate a unique DNA sequence tae. 
synthesized as pan of each gene specific primer, during 
amplification. The tags or molecular bar codes, when reintro- 
duced into the organism as a gene deletion or as a gene clone, 
can be used much more efficiently than individual mutations 
or clones because pools of tagged mutants or transformants 
can be analyzed in parallel. This parallel analysis is possible 
because the tags are readily and quantitatively amplified even 
in complex mixtures of tags (13). 

These ORF genome arrays and oligonucleotide tagged 
libraries can be used for many applications. Any conventional 
selection applied to a library that gives discrete or multiple 
products can use these technologies for a simple direct read- 
out. These include screens and selections for mutant comple- 
mentation, overexpression suppression (15. 16). second-sue 
suppressors, synthetic lethality, drug target overexpression 
(17). two-hvbrid screens (18), genome mismatch scanning ( Pp 
or recombination mapping 

The genome projects have provided researchers with a \jsi 
amount of information These data must be used efficient!; 
and systematically to gain a truly comprehensive understand 
ing of gene function and. more broadlv of the entire genome 
which can then be applied to other organisms. Such global 
approaches are essential if we are to gain an understanding of 
the living cell This understanding should come from the ^ 
viewpoint of the integration of complex regulatory networks 
the individual roles and interactions of thousands of functional : ' 
gene products, and the effect of environmental changes on 
both gene regulatory networks and the roles of all gene ' r 
products The time h;^ r^mf m switch fro m rhr- 
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The availability of genome-scale DNA sequence information and reagents has radically altered life-science 
research. This revolution has led to the development of a new scientific subdiscipline derived from a combina- 
tion of the fields of toxicology and genomics. This subdiscipline, termed toxicogenomics, is concerned with the 
identification of potential human and environmental toxicants, and their putative mechanisms of action, through 
the use of genomics resources. One such resource is DNA microarrays or "chips," which allow the monitoring of 
trie expression ieveis of thousands of genes simultaneously. Here we propose a general method by which gene 
expression, as measured by cDNA microarrays, can be used as a highly sensitive and informative marker for 
toxicity. Our purpose is to acquaint the reader with the development and current state of microarray technol- 
ogy and to present our view of the usefulness of microarrays to the field of toxicology. Mol. Carcinog. 24:153- 

159, 1999. © 1999 Wiley-Liss, Inc. 
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INTRODUCTION 

Technological advancements combined with in- 
tensive DNA sequencing efforts have generated an 
enormous database of sequence information over the 
past decade. To date, more than 3 million sequences, 
totaling over 2.2 billion bases [1], are contained 
within the GenBank database, which includes the 
complete sequences of 19 different organisms [2]. The 
first complete sequence of a free-living organism, 
Haemophilus influenzae, was reported in 1995 [3] and 
was followed shortly thereafter by the first complete 
sequence of a eukaryote, Saccharomyces cenisiae [4J. 
The development of dramatically improved sequenc- 
ing methodologies promises that complete elucida- 
tion of the Homo sapiens DNA sequence is not far 
behind [5]. 

To exploit more fully the wealth of new sequence 
information, it was necessary to develop novel meth- 
ods tor the high-throughput or parallel monitoring 
ot gene expression. Established methods such as 
northern blotting, RNAse protection assays, SI nu- 
clease analysis, plaque hybridization, and slot blots 
do not provide sufficient throughput to effectively 
utilize the new genomics resources. Newer methods 
such as differential display [6], high-density filter 
hybridization [7,8], serial analysis of gene expression 
[9|, and cDNA- and oligonucleotide-based microarray 
"chip" hybridization [10-12] are possible solutions 



Almost without exception, gene expression is al- 
tered during toxicity, as either a direct or indirect 
result of toxicant exposure. The challenge facing 
toxicologists is to define, under a given set of ex- 
perimental conditions, the characteristic and spe- 
cific pattern of gene expression elicited by a given 
toxicant. Microarray technology offers an ideal plat- 
form for this type of analysis and could be the foun- 
dation for a fundamentally new approach to 
toxicology' testing. 

MICROARRAY DEVELOPMENT AND APPLICATIONS 

cDNA Microarrays 

In the past several years, numerous systems were 
developed for the construction of large-scale DNA 
arrays. All of these platforms are based on cDNAs 
or oligonucleotides immobilized to a solid sup- 
port. In the cDNA approach. cDNA (or genomic) 
clones of interest are arrayed in a multi-well for- 
mat and amplified by polymerase chain reaction. 
The products of this amplification, which are usu- 
ally 500- to 2000-bp clones from the 3' regions of 
the genes of interest, are then spotted onto solid 
support by using high-speed robotics. By using 
this method, microarrays of up to 10 000 clones 
can be generated by spotting onto a glass substrate 
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1 1 3,14]. Sample detection for microarrays on glass 
involves the use of probes labeled with fluores- 
cent or radioactive nucleotides. 

Fluorescent cDXA probes are generated from con- 
trol and test RNA samples in single-round reverse-tran- 
scription reactions in the presence of fluorescentlv 
tagged dUTP (e.g., Cy3-dUTP and Cy5-dUTP), which 
produces control and test products labeled with dif- 
ferent fluors. The cDNAs generated from these two 
populations, collectively termed the "probe/ 7 are then 
mixed and hybridized to the array under a glass cov- 
erslip [10,11,15]. The fluorescent signal is detected 
by using a custom-designed scanning confocal mi- 
croscope equipped with a motorized stage and lasers 
for fluor excitation [ 10, 1 1 , 1 5]. The data are analyzed 
with custom digital image analysis software that de- 
termines for each DNA feature the ratio of fluor 1 to 
fluor 2, corrected for local background (16,1 7] . The 
strength of this approach lies in the ability to label 
RNAs from control and treated samples with differ- 
ent fluorescent nucleotides, allowing for the simul- 
taneous hybridization and detection of both 
populations on one microarray. This method elimi- 
nates the need to control for hybridization between 
arrays. The research groups of Drs. Patrick Brown and 
Ron Davis at Stanford University spearheaded the 
effort to develop this approach, which has been suc- 
cessfully applied to studies of Arabidopsis thaliana 
RNA [10], yeast genomic DNA [15], turn or i genie ver- 
sus non-tumorigenic human tumor cell lines [11], 
human T-cells [18], yeast RNA [19], and human in- 
flammatory disease-related genes [20]. The most dra- 
matic result of this effort was the first published 
account of gene expression of an entire genome, that 
of the yeast Sacchawmvces cervisuw [21]. 

In an alternative approach, large numbers of cDNA 
clones can be spotted onto a membrane support, al- 
beit at a low^er density [7,22]. This method is useful 
for expression profiling and large-scale screening and 
mapping ot genomic or cDNA clones [7,22-24]. In 
expression profiling on filter membranes, two dif- 
ferent membranes are used simultaneously for con- 
trol and test RNA hybridizations, or a single 
membrane is stripped and reprobed. The signal is 
detected by using radioactive nucleotides and visu- 
alized by phosphorimager analysis or autoradiogra- 
phy. Numerous companies now sell such cDNA 
membranes and software to analyze the image data 
[25-27]. 

Oligonucleotide Microarrays 

Oligonucleotide microarrays are constructed either 
by spotting prefabricated oligos on a glass support 
[13] or by the more elegant method of direct in situ 
oligo synthesis on the glass surface by photolithog- 
raphy [28-30]. The strength of this approach lies in 
its ability to discriminate DNA molecules based on 
single base-pair difference. This allows the applica- 
tion of this method to the fields of medical diagnos- 



tics, pharmacogenetics, and sequencing by hybrid- 
ization as well as gene-expression analysis. 

Fabrication of oligonucleotide chips by photoli- 
thography is theoretically simple but technically 
complex [29,30]. The light from a high-intensitv 
mercury lamp is directed through a photolitho- 
graphic mask onto the silica surface, resulting in 
deprotection of the terminal nucleotides in the illu- 
minated regions. The entire chip is then reacted with 
the desired free nucleotide, resulting in selected chain 
elongation. This process requires only 4n cycles 
(where n = oligonucleotide length in bases) to syn- 
thesize a vast number of unique oligos, the total num- 
ber of which is limited only by the complexity of the 
photolithographic mask and the chip size [29,31,32]. 

Sample preparation involves the generation of 
double-stranded cDNA from cellular poly(A)+ RNA 
followed by antisense RNA synthesis in an in vitro 
transcription reaction with biotinylated or fluor- 
tagged nucleotides. The RNA probe is then frag- 
mented to facilitate hybridization. If the indirect 
visualization method is used, the chips are incubated 
with fluor-linked streptavidin (e.g., phycoervthrin ) 
after hybridization [12,33]. The signal is detected with 
a custom confocal scanner [34]. This method has 
been applied successfully to the mapping of genomic 
library clones [35], to de novo sequencing by hybrid- 
ization [28,36], and to evolutionary sequence com- 
parison of the BRCA1 gene [37]. In addition, 
mutations in the cystic fibrosis [38] and BRCA1 [39] 
gene products and polymorphisms in the human im- 
munodeficiency virus- 1 clade B protease gene [40] 
have been detected by this method. Oligonucleotide 
chips are also useful for expression monitoring [33] 
as has been demonstrated by the simultaneous evalu- 
ation of gene-expression patterns in nearly all open 
reading frames of the yeast strain S. cerevisiae [12]. 
More recently, oligonucleotide chips have been used 
to help identify single nucleotide polymorphisms in 
the human [41] and yeast [42] genomes. 

THE USE OF MICROARRAYS IN TOXICOLOGY 

Screening for Mechanism of Action 

The field of toxicology uses numerous in vivo 
model systems, including the rat, mouse, and rab- 
bit, to assess potential toxicity and these bioassays 
are the mainstay of toxicology testing. However, in 
the past several decades, a plethora of in vitro tech- 
niques have been developed to measure toxicity, 
many of which measure toxicant-induced DNA dam- 
age. Hxamples of these assays include the Ames test, 
the Syrian hamster embryo cell transformation as- 
say, micronucleus assays, measurements of sister 
chromatid exchange and unscheduled DNA synthe- 
sis, and manv others. Fundamental to all of these 
methods is the fact that toxicity is often preceded 
by, and results in, alterations in gene expression. In 
many cases, these changes in gene expression are a 
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far more sensitive, characteristic, and measurable 
endpoint than the toxicity itself. We therefore pro- 
pose that a method based on measurements of the 
genome-wide gene expression pattern of an organ- 
ism after toxicant exposure is fundamentally infor- 
mative and complements the established methods 
described above. 

We are developing a method by which toxicants 
can be identified and their putative mechanisms of 
action determined by using toxicant-induced gene ex- 
pression profiles. In this method, in one or more de- 
fined model systems, dose and time-course parameters 
are established for a series of toxicants within a given 
prototypic class (e.g., polycyclic aromatic hydrocar- 
bons (PAHs)). Cells are then treated with these agents 
at a fixed toxicity level (as measured by cell survival), 
UNA is harvested, and toxicant-induced gene expres- 
sion changes are assessed by hybridization to a cDNA 
microarrav chip (Figure 1 } We have developed a cus- 
tom DNA chip, called ToxChip vl.O, specifically for 
this purpose and will discuss it in more detail below. 
The changes in gene expression induced by the test 
agents in the model systems are analyzed, and the 
common set of changes unique to that class of toxi- 
cants, termed a toxicant signature, is determined. 

This signature is derived by ranking across all ex- 
periments the gene-expression data based on rela- 

Control 
Population 



tive fold induction or suppression of genes in treated 
samples versus untreated controls and selecting the 
most consistently different signals across the sample 
set. A different signature may be established for each 
prototypic toxicant class. Once the signatures are de- 
termined, gene-expression profiles induced by un- 
known agents in these same model systems can then 
be compared with the established signatures. A match 
assigns a putative mechanism of action to the test 
compound. Figure 2 illustrates this signature method 
for different types of oxidant stressors, PAHs, and 
peroxisome proliferators. In this example, the un- 
known compound in question had a gene-expres- 
sion profile similar to that of the oxidant stressors in 
the database. We anticipate that this general method 
will also reveal cross talk between different pathwavs 
induced by a single agent (e.g., reveal that a com- 
pound has both PAH-like and oxidant-iike proper- 
tics). In the future, it may be necossai v lo distinguish 
very subtle differences between compounds within 
a very large sample set (e.g., thousands of highly simi- 
lar structural isomers in a combinatorial chemistrv 
library or peptide library}. To generate these highlv 
refined signatures, standard statistical clustering tech- 
niques or principal-component analysis can be used. 

For the studies outlined in Figure 2, we developed 
the custom cDNA microarray chip ToxChip vl.O. 
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Figure 2, Schematic representation of the method for iden- 
tification of a toxicant's mechanism of action. In this method, 
gene-expression data derived from exposure of model sys- 
tems to known toxicants are analyzed, and a set of changes 
characteristic to that type of toxicant (termed the toxicant 
signature) is identified. As depicted, oxidant stressors produce 
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consistent changes in group A genes (indicated by red and 
green circles), but not group B or C genes (indicated by gray 
circles). The set of gene-expression changes elicited by the 
suspected toxicant is then compared with these characteristic 
patterns, and a putative mechanism of action is assigned to 
the unknown agent. 



The 2090 human genes that comprise this subarray 
were selected for their well-documented involve- 
ment in basic cellular processes as well as their re- 
sponses to different types of toxic insult. Included 
on this list are DNA replication and repair genes, 
apoptosis genes, and genes responsive to PAHs and 
dioxin-like compounds, peroxisome proliferators, 
estrogenic compounds, and oxidant stress. Some of 
the other categories of genes include transcription 
factors, oncogenes, tumor suppressor genes, cyclins, 
kinases, phosphatases, cell adhesion and motility 
genes, and homeobox genes. Also included in this 
group are 84 housekeeping genes, whose hybridiza- 
tion intensity is averaged and used for signal nor- 
malization of the other genes on the chip. To date, 
very few toxicants have been shown to have appre- 
ciable effects on the expression of these housekeep- 
ing genes. However, this housekeeping list will be 
revised if new data warrant the addition or deletion 
of a particular gene. Table 1 contains a general de- 
scription of some of the different classes of genes 
that comprise ToxChip vl.O. 

When a toxicant signature is determined, the 
genes within this signature are flagged within the 
database. When uncharacterized toxicants are then 
screened, the data can be quickly reformatted so that 
blocks of genes representing the different signatures 



are displayed [11]. This facilitates rapid, visual in- 
terpretation of data. We are also developing Tox- 
Chip v2.0 and chips for other model systems, 
including rat, mouse, Xenopus, and yeast, for use in 
toxicology studies. 

Animal Models in Toxicology Testing 

The toxicology community relies heavily on the 
use of animals as model systems for toxicology test- 
ing. Unfortunately, these assays are inherently ex- 
pensive, require large numbers of animals and take a 
long time to complete and analyze. Therefore, the 
National Institute of Environmental Health Sciences 
(NTEHS), the National Toxicology Program, and the 
toxicology community at large are committed to re- 
ducing the number of animals used, by developing 
more efficient and alternative testing methodologies. 
Although substantial progress has been made in the 
development of alternative methods, bioassays are 
still used for testing endpoints such as neurotoxic- 
ity, immunotoxicity, reproductive and developmen- 
tal toxicology, and genetic toxicology. The rodent 
cancer bioassay is a particularly expensive and time- 
consuming assay, as it requires almost 4 yr, 1200 
animals, and millions of dollars to execute and ana- 
lyze [43]. In vitro experiments of the type outlined 
in Figure 2 might provide evidence that an unknown 
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Table 1. ToxChip v1.0: A Human cDNA Microarray 



Chip Designed to Detect Responses to Toxic Insult 





No. of genes 


Gene category 


on chip 


Apoptosis 


72 


PNA replication and repair 


99 


Oxidative stress/redox homeostasis 


90 
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Dioxtn/PAH responsive 


12 


Estrogen responsive 


63 


Housekeeping 


34 


Oncogenes and tumor suppressor genes 


76 


Cell-cycle control 


51 


Transcription factors 


131 


Kinases 


276 


Phosphatases 


8S 


Heat-shock proteins 


"> z 


Receptors 


349 


Cytochrome P450s 


3C 



* This list is intended as a general guide The gene categories are not 
unique, and some genes are hsted in multiple categories 



agent is (or is not) responsible for eliciting a given 
biological response. This information would help to 
select a bioassay more specifically suited to the agent 
in question or perhaps suggest that a bioassay is not 
necessary, which would dramatically reduce cost, 
animal use, and time. 

The addition of microarray techniques to stan- 
dard bioassays may dramatically enhance the sen- 
sitivity and interpretability of the bioassay and 
possibly reduce its cost. Gene-expression signatures 
could be determined for various types of tissue-spe- 
cific toxicants, and new compounds could be 
screened for these characteristic signatures, provid- 
ing a rapid and sensitive in vivo test. Also, because 
gene expression is often exquisitely sensitive to low 
doses of a toxicant, the combination of gene-expres- 
sion screening and the bioassay might allow the use 
of lower toxicant doses, which are more relevant to 
human exposure levels, and the use of fewer ani- 
mals. In addition, gene-expression changes are nor- 
mally measured in hours or days, not in the months 
to vears required tor tumor development. Further- 
more, microarravs might be particularly useful tor 
investigating the relationship between acute and 
chronic toxicity and identifying secondary effects 
of a given toxicant by studying the relationship 
between the duration of exposure to a toxicant and 
the gene-expression profile produced. Thus, a bio- 
assay that incorporates gene-expression signatures 
with traditional endpoints might be substantially 
shorter, use more realistic dose regimens, and cost 
substantially less than the current assays do. 



also be improved by the addition of microarray analy- 
sis. The combination of microarravs with traditional 
bioassays might also be useful for investigating some 
of the more intractable problems in toxicology re- 
search, such as the effects of complex mixtures and 
the difficulties in cross-species extrapolation. 

Exposure Assessment, Environmental Monitoring, 
and Drug Safety 

The currently used methods for assessment of ex- 
posure to chemical toxicants are based on measure- 
ment of tissue toxin levels or on surrogate markers 
of toxicity, termed biomarkers (e.g., peripheral blood 
levels of hepatic enzymes or DNA adducts). Because 
gene expression is a sensitive endpoint, gene expres- 
sion as measured with microarray technology may 
be useful as a new biomarker to more precisely iden- 
tify hazards and to assess exposure. Similarly, 
microarravs could be used ni an envnonmentai- 
monitoring capacity to measure the effect of poten- 
tial contaminants on the gene-expression profiles 
of resident organisms. In an analogous fashion, 
microarravs could be used to measure gene-expres- 
sion endpoints in subjects in clinical trials. The com- 
bination of these gene-expression data and more 
established toxic endpoints in these trials could be 
used to define highly precise surrogates of safety. 

Gene-expression profiles in samples from exposed 
individuals could be compared to the profiles of the 
same individuals before exposure. From this infor- 
mation, the nature of the toxic exposure can be de- 
termined or a relative clinical safety factor estimated. 
In the future it may also be possible to estimate not 
only the nature but the dose of the toxicant for a 
given exposure, based on relative gene-expression 
levels. This general approach may be particularly 
appropriate for occupational-health applications, in 
which unexposed and exposed samples from the 
same individuals may be obtainable. For example, 
a pilot study of gene expression in peripheral-blood 
lymphocytes of Polish coke-oven workers exposed 
to PAHs (and many other compounds) is under con- 
sideration at the NIK HS. An important consideration 
tor these types ot studies is that gene expression t an 
be affected by numerous factors, including diet, 
health, and personal habits. To reduce the effects 
of these confounding factors, it may be necessary 
to compare pools ot control samples with pools of 
treated samples. In the future it may be possible to 
compare exposed sample sets to a national database 
of human-expression data, thus eliminating the 
need to provide an unexposed sample from the same 
individual. Ffforts to develop such a national gene- 
ex press ion database art 1 currently under umv U4 4^' 
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Alleles, Oligo Arrays, and Toxicogenetics 

Gene sequences vary between individuals, and 
this variability can be a causative factor in human 
diseases of environmental origin [46,47], A new area 
of toxicology, termed toxicogenetics, was recently 
developed to study the relationship between genetic 
variability and toxicant susceptibility. This field is 
not the subject of this discussion, but it is worth- 
while to note that the ability of oligonucleotide ar- 
rays to discriminate DNA molecules based on single 
base-pair differences makes these arrays uniquely 
useful for this type of analysis. Recent reports dem- 
onstrated the feasibility of this approach [41,42]. 
The NIEHS has initiated the Environmental Genome 
Project to identify common sequence polymor- 
phisms in 200 genes thought to be involved in en- 
vironmental diseases [48]. In a pilot study on the 
feasibility of this application to the Environmental 
Genome Project, oligonucleotide arrays will be used 
to resequence 20 candidate genes. This toxicogenetic 
approach promises to dramatically improve our un- 
derstanding of interindividual variability in disease 
mi sceptibility. 

FUTURE PRIORITIES 

There are many issues that must be addressed be- 
fore the full potential of microarrays in toxicology 
research can be realized. Among these are model sys- 
tem selection, dose selection, and the temporal na- 
ture of gene expression. In other words, in which 
species, at what dose, and at what time do we look 
for toxicant-induced gene expression^ If human 
samples are analyzed, how variable is global gene 
expression between individuals, before and after toxi- 
cant exposure? What are the effects of age, diet, and 
other factors on this expression? Experience, in the 
form of large data sets of toxicant exposures, will 
answer these questions. 

One of the most pressing issues for array scientists 
is the construction of a national public database 
(linked to the existing public databases) to serve as a 
repository for gene-expression data. This relational 
database must be made available for public use, and 
researchers must be encouraged to submit their ex- 
pression data so that others may view and query the 
information. Researchers at the National Institutes 
of Health have made laudable progress in develop- 
ing the first generation of such a database [44,45]. In 
addition, improved statistical methods for gene clus- 
tering and pattern recognition are needed to ana- 
lyze the data in such a public database. 

The proliferation of different platforms and meth- 
ods for microarray hybridizations will improve 
sample handling and data collection and analysis and 
reduce costs. However, the variety of microarray 
methods available will create problems of data com- 
patibility between platforms. In addition, the near- 
infinite variety of experimental conditions under 



which data will be collected by different laborato- 
ries will make large-scale data analysis extremely dif- 
ficult. To help circumvent these future problems, a 
set of standards to be included on all platforms 
should be established. These standards would facili- 
tate data entry into the national database and serve 
as reference points for cross-platform and inter-labo- 
ratory data analysis. 

Many issues remain to be resolved, but it is clear 
thjt new molecular techniques such as microarray 
hybridization will have a dramatic impact on toxicol- 
ogy research In the future, the information gathered 
from microarray-based hybridization experiments will 
form the basis for an improved method to assess the 
impact of chemicals on human and environmental 
health. 
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Abstract 

Recent progress m genomics and proteomics technologies has created a unique opportunity to significantly impact 
:he pnarmuccutical drug development processes The perception that cells and whole organisms express specific 
inducible responses to stimuli such as drug treatment implies that unique expression patterns, molecular hngerpnnts. 
muicame o! u drus's efficacy and potential toxicity are accessible The integration into state-of-the-art toxicology of 
assays allo^in^ one to profile treatment-related changes in gene expression patterns promises new insights into 
mechanisms of drug action and toxicity. The benefits will be improved lead selection, and optimized monitoring of 
drug efficacv and safety in pre-chnical and clinical studies based on biologically relevant tissue and surrogate markers. 
<T :00U Else\ier Science Ireland Ltd. All rights reserved. 

Kc. mmis Proteomics. Genomics. Toxicology 



1. Introduction 

The maionty of drugs act by binding to protein 
targets, most to known proteins representing en- 
zvmes. receptors and channels, resulting m effects 
>uch as enzyme inhibition and impairment ot 
>i<rnal transduction. The treatment-inducec per- 
flations provoke feedback reactions aiming to 
compensate for the stimulus, which almost always 
are associated with signals to the nucleus, result- 
ing in altered gene expression. Such gene expres- 
sion regulations account for both the 
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pharmacological action and the toxicity of a drug 
and can be visualized bv either global mRNA or 
global protein expression profiling. Hence, for 
each individual drug, a characteristic gene regula- 
tion pattern, its molecular fingerprint, exists 
which Dears valuable information on its mode of 
action and its mecnanism of toxicity 

Gene expression is a muinstep process that 
results in an active protein (Fig !) There exist 
numerous regulation svstems mat exert control at 
and after the transcription and the translation 
step Genomics, by definition, encompasses the 
quantitative analysis of transcripts at the mRNA 
level, while the aim of proteomics is to quantify 
2ene expression further down-stream, creating a 
snapshot of sene regulation closer to ultimate cell 



^ S Sterne* \' L -inaerson Toxn 

2. Global mRNA profiling 

Expression data at the mRNA ievel can be 
produced using a set of different technologies 
such as UNA microarrays, reverse transenpt 
imaging, amplified fragment length polymorphism 
(AFLP). senal analysis of gene expression 
(SAGE) and others. Currently. DNA microarrays 
are very popular and promise a great potential. 
On a typical array, each gene of interest is repre- 
sented either by a iong DNA fragment (200-2400 
bp) typically generated by polymerase chain reac- 
tion fPCR) and spotted on a suitable substrate 
using robotics (Schcna et al.. 1995; Shalon et aL 
1996) or by several short oligonucleotides (20-30 
bp) synthesized directly onto a solid support using 
photolabile nucleotide chemistry (Fodor et al.. 
1991; Chee et al.. 1996). From control and treated 
tissues, total RNA or mRNA is isolated and 
reverse transcribed in the presence of radioactive 
or fluorescent labeled nucleotides, and the labeled- 
probes are then hybridized to the arrays. The 
intensity of the array signal is measured for each 
gene transcript by either autoradiography or laser 
scanning confocal microscopy. The ratio between 
the signals of control and treated samples reflect 
the relative drug-induced change in transenpt 
abundance. 
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3. Global protein profiling 

Giobal Quantitative expression analysis a; the 
protein leve! is currents -esrr.et-c ro the use of 
two-dimensional gel electrophoresis This tech- 
nique combines separation of tissue proteins bv 
isoelectric focusing in the nrs: dimension and by 
sodium dodecyl sulfate siar ge: electrophoresis", 
based molecular weight separation on tne second 
orthogonal dimension (Anderson et ai 1991)' 
The product is a rectangular pattern or orotein 
spots that are typicalh revealed :n Cocmassie 
Blue, silver or fluorescent staininc 1F12. \ 
Protein spots are tdentihec bv mass spectrometry 
following generation of peptide mass fingerprint* 
(Mann et al.. 1993) and seauence tags (NVilkins et 
al.. 1996). Similar to the mRNA approach, the 
ratio between the optical densitv of spots from 
control and treated samples are compared to 
search for treatment-reiatec changes. 



4. Expression data analysis 

Bioinformatics forms a key element required to 
organize, analyze and store expression data from 
either source, the mRNA or the protein level. The 
overall objective, once a mass of high-quality 
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Fig. 1 Production of an acme protein is a muiustep process in which numerous regulauon svsterm exert control at vanou> stage, 
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quantitative expression data has been collected, is 
•o visualize complex patterns of gene expression 
hanses. to detect pathways and sets ot genes 
•^htK correlated with treatment efficacy and toxi- 
ci-v and to compare the effects of different sets of 
treatment (Anderson et aL 1996) As the drug 
effect database is erowine. one may detect similar- 
ities and differences between the molecular nnger- 
pnms produced bv various drugs, intormation 
that be crucial to make a decision whether w 
reiocus or extend the therapeutic spectrum ot a 



5. Comparison of global mRN'A and protein 
expression profiling 

There are several svnergies and overlaps of data 
obtained bv mRNA and protein expression analy- 
sis Low abundant transcripts may not be easily- 
quantified at the protein level using standard two- 
dimensional eel electrophoresis analysis and their 
detection mav require prefracnonation of sam- 
ples The expression of such genes may be preter- 
ablv Quantified at the mRNA level using 
..--n. a ue< allow,™ PCR-mediated target amplifi- 
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canon. Tissue biopsy samples typically yield sood 
q uaiity or ooth mRNA and proteins. however, me 
quality of mRNA isolated from bod> fluids is 
orten poor due to the faster degradation of 
mRNA wESh compared with proteins. RN A sam- 
ples irom bod> fluids such as serum or urine are 
often not very meaningful*, and secreted proteins 
are likeiy more reliable surrogate markers for 
treatment efficacy and safetv Detection of post- 
radiational modifications, events often related to 
Junction or nonfunction of a protein, is restricted 
to protein expression analysis and rarely can be 
predicted by mRNA profiling. Information on 
subcellular localization and translocation of 
proteins has to be acquired at the level of the 
protein m combination with sample prefractiona- 
tion procedures. The growing evidence of a poor 
correlation between mRNA and protein abun- 
dance i Anderson and Seiihamer. 1997) further 
suggests that tne two approaches. mRNA and 
protein profiling, are complementary and should 
be applied in parallel. 



6. Expression profiling and drug development 

Understanding the mechanisms of action and 
toxicity, and being able to monitor treatment 
efficacy and safety during trials is crucial for the 
successful development of a drug. Mechanistic 
insights are essential for the interpretation of drue 
effects and enhance the chances of recoenizms 
potential species specificities contributing to an 
improved risk profile in humans (Richardson et 
aL 1995. Sterner et aL 1996b; Aicher et aL 1998). 
The value of expression profiling further increases 
when links between treatment-induced expression 
profiles and specific pharmacological and toxic 
endpoints are established (Anderson et aL 1991. 
1995. 1996; Sterner et ai. 1996a). Changes in gene 
expression are known to precede the manifesta- 
tion of morphological alterations, giving expres- 
sion profiling a great potential for early 
compound screening, enabling one to select drug 
candidates with wide therapeutic windows 
reflected by molecular fingerprints indicative of 
high pharmacological potency and low toxicity 
(Arce et aL 1998). In later phases of drus devel- 
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Perspectives 

The basic methodology of safety evaluation has 
changed httie dunns tne ras: aecaoes T^.^im 
laboratory animals has Deer, evaluated rnmaniv 
by using hematological. c::mcai cnemistr. ^ d 
histological parameters as indicators of 'orzln 
damage. The rapid progress in genomics and pro. 
teomics technologies creates a unique opportunity 
to dramatically improve the predictive power of 
safety assessment and to accelerate the drug devel- 
opment process. Application of gene and protein 
expression profiling promise* to improve lead se- 
lection, resulting in tne development of drug can- 
didates with higher efficacy and lower toxic- 
The identification of biologically relevant surr - 
gate markers correlated with treatment efficacy 
and safety bears a great potential to optimize the 
monitoring of pre-ciinical and clinical trails. 
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Workshop Summary 



Application of DNA Arrays to Toxicology 



John C. Rockett and David J. Dix 

Reproductive Toxicology Division, National Health and Environmental Effects Research Laooratory. U.S. Environmenta Protection 
Agency, Research Triangle Parte, North Carolina. USA 



DNA array technology makes it possible to rapidly genotype individuals or quantify the expression 
of thousands of genes on a single filter or glass slide, and holds enormous potential in toxicologic 
applications. This potential led to a U.S. Environmental Protection Agencv-sponsored workshop 
tided "Application of Micro arrays to Toxicology" on 7-8 January 1999 ui Research Triangle Park, 
North Carolina. In addition to providing state-of-the-an information on the application of DNA or 
gene microarravs. the workshop catalyzed the formation of several collaboration*, committees, and 
user s groups throughout the Research Tnangle Park area and beyond. Potential application of 
rrucroarravs to toxicologic research and nsk assessment include genome-wide expression analyses to 
identify gene-expression nerworks and toxicant-specific signatures that can be used to define mode 
of action, for exposure assessment, and for environmental monitoring. Arrays mav also prove useful 
for monitoring genetic variability and its relationship to toxicant susceptibility in human popula- 
tions. Key words-. DNA arrays, gene arrays, microarrays, toxicology. Environ Health Perspec: 
107:681-685 (1999). [Online 6 July 1999] 



Decoding the genetic blueprint is a dream that 
offers manifold returns in terms or understand- 
ing how organisms develop and runcuon in an 
orten hostile environment, "w'lth the rapid 
aavances in molecular biology over the last 30 
years, the dream has come a step closer to reali- 
ty. Molecular biologists now have the ability to 
elucidate the composition or any genome. 
Inaeed, almost 20 genomes have already been 
sequenced and more than 60 are currently 
under wav. Foremost among these is the 
Human Genome Mapping Protect However, 
the genomes of a number or commonly used 
laboratory species are also under intensive 
investigation, including yeast. Arabidopsis. 
maize, nee, zebra fish, mouse, rat. and dog. It 
tb widelv expected that the completion ot such 
programs will facilitate the development ot 
manv powcrrui new techniques and approach- 
es to oiaznosing ana treating grneucaiiv ana 
:nvironmentaIlv induced diseases wrucn amict 
rr.ankjnd. However, me vast amount oi data 
^e:ng generated by genome mapping will 
require new high- thro ugn put technologies to 
investigate the function or the millions of new 
ccnes that axe being reported. Among the most 
widely heralded ot the new functional 
genomics technologies are DNA arrays, which 
represent perhaps the most anticipated new 
molecular biology technique since polymerase 
aum reaction (PGR). 

Arravs enable the studv ot literally thou- 



has driven venture capitalists into a frenzv or 
investment and many new companies are 
springing up to claim a share or this rapid! v 
developing market. 

The U.S. Environmental Protecnon 
Agency (EPA) is interested in applying DNA 
array technology to ongoing toxicologic stud- 
ies. To learn more about the current state of 
the technology, the Reproductive Toxicology 
Division (RTD) of the National Health and 
Environmental Effects Research Laboratory 
(NHEERL Research Triangle Park. NC) 
hosted a workshop on "Application of 
Microarravs to Toxicology" on ~-S January 
1999 in Research Triangle Park, North 
Carolina. The workshop was organized bv 
David Da. Robert Kaviock. and John Rockett 
of the RTD/NHEERL. Twenrv-rwo intra- 
mural and extramural scientists from govern- 
ment, aoaernia. ana industry snared lnrorma- 
uon. data, and opinions on tne current and 
nature applications ror this oca an g new tech- 
nology. The workshop had more man 1^0 
attendees, including researchers, students, and 
-administrators from tne EPA, the. National 
Institute of Environmental Health Sciences 
(NIEHS), and a number of other establish- 
ments from Research Triangle Park and 
beyond. Presentations ranged from the tech- 
nology behind arrav production through the 
sharing of actual experimental data and proiec- 
tions on the future importance and applica- 



a recuiar parrcm to some Kind or supportive 
medium. DNA a::a\ orten used inter- 
changeably witn gene arrav or microarrav. 
.\ltnoueh not rormaiiv denned, microarrav is 
eenerailv used to describe me mener densirv 
arravs rvpicaJIv printed on glass chips The 
DNA elements that make up DNA arravs 
can be oligonucleotides, partial cene 
sequences, or full-iengtn cDNAs Companies 
orrenng pre-made arrays that contain less 
than ruli-iengTh clones normally use regions 
or the genes wmch are specific to that gene to 
prevent raise positives arising through cross- 
hvoridization. Sequence verification of 
cDN.A clone identity is necessity because of 
errors in identify ing specific clones rrom 
cDNA libraries and databases. Premade 
DNA arravs printed on membranes are cur- 
rently or imminently available tor human, 
mouse, and rat. In most cases thev contain 
DNA sequences representing several thou- 
sand different sequence clusters or genes as 
delineated through the National Center for 
Biotechnology Information UmGene Project 
[JTi. Many of these different UniGene clusters 
'putative genes) are represented only by 
expressed sequence tags (ESTs). 

Array Printing 

Arravs are rvpicallv printed on one of rwo 
rvpes or support matrix. Nylon membranes 
are used bv most off-the-shelf array providers 
such as Clontech Laboratories, Inc. 
'Palo Alto. CA). Genome Systems, Inc. {St. 
Louis, MO), and Research Genetics. Inc. 
Huntsville. AL:. Microarravs such as those 
produced nv Afrvmerrix. inc. (Santa Clara. 
CA.. inevre Pharmaceuticals, inc. (Palo .Alto. 
CA,. and many do-it-vourself (DIY) arraying 
groups use glass waters or slides. Although 
standard microscope slides mav be used, they 
must be preprepared to facilitate sticking 
or the DNA to the glass. Several different 

Address correspondence to ! Rockcn. Reproductive 
Toxicology Division (MD-72). National Health 
and Environmental Effects Research Laboratory, 
U.S. EPA. Research Tnangie Park, NC 27711 
USA Telephone '919i 541-2678. Fax: (919) 541- 
40 !~ E-man rocket: lonn^epa.pov 



^:ru id trie teumoiogv ... despite trus nux,z 
: urge or interest, DNA arravs are still lime used 
and largely unproven. as dernonstrated by the 
hieh ratio of review and press articles to actual 



Array Elements 

In the context ot molecular biology, the word 
arrav" is normally used to rerer to a series of 

!">N A , or protein elements ::rm;v ar:acr.e :r. 
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coatings have been successfully used, includ- 
ing siiane and ivsine. The coanng of slides 
can eajiiv be earned out ir. tne iaDorarorv, 
Du: manv prerrr me convenience or precoatec 
slides available from suppliers. 

Once the support matrix has been pre- 
pared, me DNA elements can be applied by 
several methods. AffvrnetrLX, Inc., has devel- 
oped a unique pnorohthographic technology 
ror artacfung oiigo nucleotides to gUs* wafers. 
More commonly, DNA is applied bv eitne: 
noncontacr or contact printing. Noncontact 
pnnters can use cnermai. solenoid, or piezoelec- 
tric technology to spray aliquots or solution 
onto the support matrix and mav be used to 
produce slide or membrane-based arravs. 
Cartesian Technologies, inc. i Irvine, CAj has 
developed nQUAD technology for use in its 
PixSvs pnnters. The svstem couples a svrmge 
pump with the microsoienoid vaive, a combi- 
nauon that provides rapid quantitative dispens- 
ing or nanoliter volumes idown to h.Z nLj over 
a variable volume range. A cufferenr approacn 
to noncontact printing uses a solid pin and ring 
combination (Genetic MicroSvstems, Inc., 
^Coburn, MA j. "This svstem (Figure lj allows a 
oroader range or sample, including ceil suspen- 
sions and particulates, because the printing; 
head cannot be blocked up in the same wav as 
a spray nozzie. Fluid transfer is controlied in 
this system primaniv bv the pin dimensions 
and the force of deposition, although the 
nature or the support matrix and the sample 
will also affect transfer to some degree. 

In contact printing, the pin head is dipped 
in the sample and then touched to the support 
matrix to deposit a small aliquot. Split pins 
were one or the first contact-printing devices 
to be reported and are the suggested format 
ror Oft' arrayers, as described bv Brown (5). 
Split pins are small metai pins with a precise 
groove cut vertically in the middle of the pin 
tip. In this system. I — *S spilt pins are posi- 
tioned in the pin-head- Tne spilt Dins work bv 
simple capiliarv action, not unlike a fountain 
pen — when the pin heads are dipped in the 
sample, liquid is drawn into the pin groove. A 
small (fixed) volume is then deposited each 
time the split pins are gently touched to 
the support matrix. Sample (100-500 pL 
depending on a variety of parameters) can be 
deposited on multiple slides before refilling is 
required, and array densities of > 2,500 
spots/cm- may be produced. The deposit vol- 
ume depends on the split size, sample fluidi- 
ty, and the speed of printing. Split pins are 
relatively simple to produce and can be made 
ln-house if a suitable machine shop is avail- 
able. Alternatively, they can be obtained 
directly from companies such as TeleChem 
International, Inc. (Sunnyvale, CA). 

Irrespective of their source, printers 
should be run through a preprint sequence 
prior to producing the actual experimental 



arravs; me firs: 100 or so spots of a new run 
tend to be somewhat variable. Factors effect- 
ing spot reproducibility include slice treat- 
ment nomogeneirv. sampie differences, anc 
instrument errors. Other factors mat come 
into plav inciude clean election of tne drop 
and clogging ;nQL'AD printing anc 
mechanical variations and lone-term alter- 
ation in print-head surface of solid and spin 
pins. However, with careful preparation it is 
possible to get a coefficient of variance ror 
spot rcproducibihrv below 10°o. 

One potential printing problem is sample 
carryover. Repeated washing, blotting, and 
drving f vacuum) or pnnt pins befween samples 
is normally effective at reducing sample carry- 
over to negligible amounts. Printing should 
also be carried out in a controlled environ- 
ment. Humidified chamDers are available in 
which to place printers. These help prevent 
dust contamination and produce a uniform 
drvmg rate, which ls important in determining 
spot size, quality, and reproducibility. 

In summary, although several printing 
technologies are available, none are par- 
ticularly outstanding and tne bottom line 
is that they are still in a reianveiv eariv stase 
or evolution. 

Array Hybridization 

The hybridization protocol is. practically 
speaking, relatively srrajghrtorward and those 
with previous experience in blotting should 
have little difficult)'. Array hybridizations 
are. in essence, reverse Southern/Northern 
blots — instead or applying a laocied prooe to 
the target population or DNA'RNA. the 
labeled population is applied to the probetsi. 
V&ich membrane- based arravs, me control and 
treated mRNA populations are normally con- 
verted to cDNA and labeled with isotope ie.e-. 
• P) in the process. These uDeiec populations 
are men nvrjncuzeu independently to parallel 
or serui arravs anc me nvDnoization sicnaj is 
deiected with a phosporimager. A iess com- 
monly used alternative to radioactive proDes is 
enzvmatic detection. The prooe mav be 
biotinylated, haptenylated^or have alkalme 
phosphatase/horseradish peroxidase artacned. 
Hybridization is detected bv enzvmatic reac- 
tion yielding a color reaction (4). Differences 
in hybridization signals can be detected bv eve 
or, more accurately, with the heip of digital 
imaging and commercially available software. 
The labeling of the test populations for slide- 
based microarravs uses a slightly different 
approach. The probe tvpicaUv consists of two 
samples of polyA* RNA I usually from a treated 
and a control population) that are converted to 
cDNA; in the process each is labeled with a 
different fluor. The independently labeled 
probes are then mixed together and hybridized 
to a single microarray slide and the resulting 
combined fluorescent signal is scanned. After 




Figure 1. Genetic Microsystems fWoburn, MA) pm 
ring system tor printing arrays The pin ring com- 
Dmation consists or a circular open ring oriented 
parallel tc the samoie solution with a vertical pin 
centered over tne ring When the ring is dipped 
into a solution anc hftea. it withdraws an aliquot 
of sampie held bv surface tension. To spot the 
sample the pin is driven down through the ring 
and a portion of the solution is transferred to the 
Dortom of tne pm. Tne pin continues to move 
downward until the pendant drop of solution 
manes contact with the underlying surface. The 
Din is then hfted, and gravity and surface tension 
cause deposition of the spot onto the array. 
Figure from Flowers et al ( 14), with permission 
from Genetic Microsystems 

normalization, ir is possible to determine the 
rano of fluorescent signals from a single 
nvpndization or j shde-oascd microaxrav. 

cDNA derived rrom control and treated 
populations of RNA is most commoniv 
hybridized to arravs, although subtractive 
hvDndizanon or differential display reactions 
mav ajso De usee. Fiuoropnore- or radiota- 
Deiea nucleotides ire cirecuv incorporated 
into tne cDNA in tne process or converting 
RNA to cDNA. Alternatively, 5 end-labeled 
primers mav be used for cDNA synthesis. 
These are labeled with a fiuoropnore for 
direct visualization of the hvbndized array. 
.Alternatively, biotin or a hapten mav be 
attached to the primer, in which case fluor- 
labeled streptavidin or antibody must be 
applied before a signal can be generated. The 
most commoniv used fluorophores at present 
are cyanine (Cy;3 and Cy5 (Amersham 
Pharmacia Biotech AB. Uppsala, Sweden). 
However, the relative expense of these fluo- 
rescent conjugates has driven a search for 
cheaper alternatives. Fluorescein, rhodamine. 
and Texas red have all been used, and 
companies such as Molecular Probes, Inc. 
(Eugene, OR) are developing a series of 
labeled nucleotides with a wide range of exci- 
tation and emission spectra which may prove 
to function as well as the Cv dves. 
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Analysis of DNA Microarrays 

MemDrane-based arravs arc normal! v anaivzec 
on aim or with a phosphonmager, wnereas 
chip-based arravs require more specialized scan- 
ning devices. Tncse can be divided into three 
main eroups: the charge-coupied device camera 
svstems, the nonconibcal laser scanners, and the 
confocal laser scanners. The advantages and dis- 
aavantaees of each svstem are listed in Tabie 1 . 

Because a rvpical spot on a microarrav can 
contain > 10 8 moiecuies. it ls dear that a large 
variation in signal strength mav occur. 
Current scanners cannot work across this 
manv orders of magnitude (h or 5 is more typ- 
ical >. However, the scanning parameters can 
normaJlv be adjusted to collect more or less 
signai, such that two or three scans of the same 
arrav should permit the detection ot rare and 
abundant genes. 

'When a microarrav is scanned, the fluores- 
cent images are captured bv sorrware normally 
included with the scanner. Several commercial 
suppliers provide additional sorrware tor quan- 
tifying array images, but the sorrware tools are 
constantly evolving to meet the developing 
needs of researchers, and it is prudent to 
define one's own needs and clarify the exact 
capabilities of the software before its purchase. 
Issues that should be considered include the 
following: 

• Can the sorrware locate offset spots? 

• Can ir quanntate across irregular hybridiza- 
tion signals? 

• Can the arrived genes be programmed in for 
easv identification and location? 

• Can the software connea via the Internet to 
databases containing further information on 
the genet's) of interest? 

One of the key issues raised at the work- 
shop was the sensitivity of microarrav technol- 
ogy. Experiments bv General Scanning. Inc. 
^ atenown. MA), have shown that bv using 
:r.c Cv dves and their scanner, signal can be 
detected down to levels of < 1 fluor molecule 
per square micrometer, which translates to 
detecting a rare message at approximately one 
copy per cell or less. 

Array Applications 

Ai though arrays are an emerging technology 
certain to undergo improvement and 
alteration, they have already been applied use- 
fully to a number of model systems. Arrays are 
r their most powerful when thev contain the 

: . u j T.nis reason, t.nr* rsavc scorn 

c> 

notr. ot these species have been sequenced and. 
in the case of yeast, deposited onto arravs for 
examination of gene expression {6,7). With 
ootr. of tncse species, u is relatively easv to 
:-er* n j. 1 . erne expression Indeed / .... 



Table 1. Aflvamages and disadvantages of different microarrav scanning svstems 

CCD camera svsterr Noncontccal laser scanne' Ccr*c;a .ase * sca-^e- 



Aovantages rew novmj par*^ ne.a:.ve.» simrje crjics 5~a Zzz~~- *-:-s 'e;-:e: 

tacts 

fas: scanning Drigh* — Vav K ave ^C" • " -e:: :" 

samoies e n ::ienc\ 

Disaovarrtages Less aooroDnate ror cinr low i;gr: :o:ie:non prhcienrv Srra'i cecr* 1 c* fec-s f e:^i r es 

samples scaring srecisien 

Optical scatter can timt; Bacitgrcurc arrets no: re ; e: T e: 
performance 

flesotutic" rvcicaiu 10* 



CCD. cnarge-couoiea aevice 
From kawasm ( 731 

elegans knockouts can be made simplv bv 
soaking the worms in an antisense solution or 
the gene to be knocked out. 

By a process of systematic gene disrup- 
tion, it is now possible to examine the cause 
and effect relationships berween different 
genes in these simple organisms. This kind or 
approach should help elucidate biochemical 
pathways and genetic control processes, 
deconvolute polygenic interactions, and 
define the architecrure of the cellular network. 
A simple case study of how this can be 
achieved was presented by Butow [L'ruversirv 
of Texas Southwestern Medical Center, 
Dallas, TX (Figure 2)j. Although tt is the 
phenocypic result of a single gene knockout 
that is being examined, the effect of such 
perturbation will almost aiwavs be polygenic. 
Polygenic interactions will become increasing- 
ly important as researchers begin to move 
away rrom single gene svstems when examin- 
ing the nature of toxicologic responses to 
external stimuli. This is especially important 
in toxicology because the phenorype pro- 
duced bv a given environmental insuit is 
never the result of the acnon of a single gene, 
rather, it is a complex interaction or one or 
multiple cellular pathways. Phenomena :>uch 
as quantitative trait (the continuous variation 
of phmorvpcj. CDisiasis itne effect or aueies o: 
one or more genes on tne expression or otner 
genes), and penetrance (proportion or indi- 
viduals of a given genorvpe that disp;av a par- 
ticular phenorypej will become increasingly 
evident and important as toxicoiogists push 
toward the ultimate goal or matching the 
responses of individuals to different 
environmental stimuli. 

Analysis of the transenptome (the expres- 
sion level of all the genes in a given cell popula- 
tion) was a use of arravs addressed bv several 
speakers. Unfortunately, current eenr nomerv 



transenptomes ror human, rat. and mouse. In a 
slightiv different approach. Nuwavsir et ai. i# 
describes how tne NIEHS assemoiea wnac is 
effectively a ' toxicoiogic-U transenptome' — a 
librarv of human and mouse genes that have 
previously been proven or implicated in 
responses to toxicologic insults. Clonrech 
Laboratories, inc. (PaJo .Alto. CA). has begun a 
similar process bv- developing stress^ toxicology 
fiiter arravs or rat. mouse, and human genes. 
Thus, rather than being tissue or cell specific, 
these stress/ toxicology arravs can be used across 
a variety or model svstems to look tor alter- 
ations in the expression or toxicoiogically 
important genes and define the new field of 
toxjcogenomics. The potential to identify toxj- 
cant families based on tissue- or cell-specific 
gene expression could revolutionize drug test- 
ing. These molecular signatures or fingerprints 
could not only point to the possible 
toxicirv/carcinogenicity of newlv discovered 
compounds (Figure 3). but also aid in elucidat- 
ing their mechanism or action through identifi- 
cation ot gene expression networks. Bv exten- 
sion, such signatures could provide easilv iden- 
tifiable biomarkers to assess the degree, time, 
and nature of exposure. 

DNA arravs ire pnmarilv a tool for exam- 
ining; dirferentiaJ eene expression in a given 
moaei. in mis comer: tnev ire rrrerrra to as 
vjoseu svstems because tnev iacx the abiiirv or 
otner differential expression technologies, e.g., 
Jirrerentiai display ana subtractive hybridiza- 
tion, to detect previously unknown genes not 
present on the arrav This would appear to 
limit the power or DNA arrays to the imagina- 
tions and preconceptions of the researcher in 
selecting genes previously characterized and 
thought to be involved in the model svstem. 
However, the various genome sequencing pn> 
lects have created a new category of 
^rauence — the EST — that has partially molli- 



gene nomenclature Nevertheless, once a tran- 
senptome has been assembled it can then be 
transferred onto arravs and used to screen anv 

chosen svstem The EPA M icrc Arrav 
.•".sort:-- : n \V...V~ . asvemr..."L- 



'■ra,-^'''^' - >v» nm o-rn assigned 

specific genetic identity Bv mcorporaung EST 
clones into an arrav. it is possible to monitor 
the expression of these unknown genes. This 

can enable thr identification of previously 

...;;a..r«i.. :r:; r ::\r nave biologic 



significance in tne model svstem. Fiir-r arravs 
rrorr. Research Genetics ind slide arravs rrom 
inevte Pharmaceuticals born incorporate iaree 
numDers or LSTs rrom a variety or species. 

A rurtner use or microarravs is rne identifi- 
cation or single nucleotide poi vmorphisms 
SNPs . These genomic variations are abun- 
dant — chev occur approximately everv 1 kb or 
sn — and are me basis or restriction rraemen: 
length poivmorpnism anajvsis used in forensic 
anaivsis Aiiymetrix. Inc.. designed chips that 
contain multiple rcpears or trie same eene 
sequence. Lacn position is present with ail rour 
possible bases. After rne hvoridizanon or" the 
sample, the degree of h\T>ridization to the dif- 
ferent sequences can be measured and the exact 
seauence of rne target gene deduced. SNPs are 
thought to be or vital importance in drue 
metabolism and toxicology. For example, sin- 
gle pase dirrerences in the reeuiatorv region or 
active sire or some genes can account for nuee 
differences in the activity of that gene. Such 
SNPs are thought to explain whv some peopie 
are aDie to meraoouze cenain xenobioncs ber- 
ter than others. Thus, arravs provide a runner 
tool ror the toxicologisr investigating the 
nature or susceptible suDpopuianons and toxi- 
cologic response. 

There are still manv wrinkles to be ironed 
out berore arrays become a standard tool for 
toxicologists. The main issues raised at the 
workshop by those with hands-on experience 
were the following: 

• Expense: the cost of purchasmg'contracnng 
this technology is still too great for manv 
individual laboratories. 










Figure t Potential effects of gene knockout within 
positively and negatively regulated gene expression 
networks. t : is limiting in wild type for expression of 
u \A) A simple, two- component, linear regulatory 
networtt operating on gene ^ where / t is a positive 
effector of ^ and j n is erther a positive or negative 
effector of i y This network could be deduced by 
examining the consequence of {B) deleting j n on the 
expression of /, and ^ where the expression of L 
would be decreased or increased depending on 
wnetner ;„ was a positive or negative regulator 
These and other connected components of even 
greater compiexrty could be revealed by genome- 
wiae expression analysis. From Butow ( 75) 



• Clones: the iogisncs of identifying, obtaining, 
and maintaining a set of nonredundan:. non- 
conraminared, seauence-verihed. species/ ceil 
tissue' rieia -specific aoncs. 

• Lse of mpred strains- where whoie-oreanism 
modcis are being usea, tne use of mDred 
strains is important to reduce tne potemuuv 
confusing effects of the individual variation 
rvpicailv seen m ourbred populations 

• Probe: r_he need for reianvelv large amounts 
or RNA. which limits the rvpe of sample 
i e.g.. biopsvi that can be used. Also, different 
RNA extraction metnods can give different 
results. 

• Specirlcirv: the abiiirv to discriminate accu- 
rately between closely related genes its., the 
cvTocnrome ch50 family i and splice variants. 

•Quantitation: the quantitation of eene 
expression using gene arravs is still open to 
debate. One reason for this is the different 
incorporation of the labeling dves. Howrver. 
the main difficulty lies in knowing what to 
normalize against. One option is to include a 
large n urn Per or so-called housekeeping eenes 
m the arrav. However, rhe expression of these 
genes orten change depending on the tissue 
and the toxicant, so it is necessarv co charac- 
terize the expression of these genes in the 
model svsrem before utilizing them. This is 
ciearlv not a viable option when screening 
multiple new compounds. A second option 
is to include on the arrav genes from a nonre- 
lated species (e.g.. a plant gene on an animal 
arrav) and to spike the probe with svnmetic 
RNA(s) complementary to the eene(s). 

• Reproducibility: this is sometimes uuesnon- 
able. and a figure of approximately two or 
three repeats was used as the minimum num- 
ber required to confirm initial findings. 



Ttn compound 1 



Again, hour. sr. must re.r.. _ 
use or Norm em Diets or reverse tran^-ip^e 
PCR to confirm r:nu:r.c:- 

• >ensmvirv: concern ^ were \o.ced apo:;: :r-: 
"number of targe: molecules that must rx- pre- 
sent :r. a sample ror ;nem ro rv detected or. 
tne arrav. 

• rirr.cier.c*- reproducible identification of i ^- 
to 2-rold dirrerences in expression was repon- 
ed. although tne numpe: of cenes that 
undergo tnis ieve: of cnange and remain 
undetected is open to debate, i: is important 
that thti level of detection be ultimately 
achieved because i: is .ornmomv perdue J 
mat some important Transcription raptors 
and their regulators respond ar sucn iov\ i L -\- 
cii. in most cases. 3- tc ^-foiu was the mini- 
mum change :ha: most uere happy to 
accept. 

• Biomtormatics: perhaps tne greatest concern 
was now ro accurately interpret the data with 
the greatest accuracy jnd efficiency. The 
biggest headache is trying to identify net- 
works of gene expression that are common to 
different treatments or do>cs. Tne amount of 
data rrom a single experiment is huge, it mav 
De that, in tne future. >cverai groups individ- 
ually equipped with specialized ^orrware algo- 
rithms ror studying their favorite genes or 
gene systems will be able to share the same 
hvbndized chips. Thus, arravs ^ouJd usher in 
a new perspective on collaboration and the 
sharing of data. 

EPAMAC 

Perhaps the main reason most scientists are 
unabie to use array technology is the high cost 
involved, whether buving ort-the-shelf mem- 
branes, using contract printing services, or 



Enoocrint auruoion 



Toxicant family 




GnOini stressor $ 



Peromomt proliltf*tors 



Poiveydic aromjtic fivflrociroons 



«nT? a mi «P™«»«n Proves— also called fingerprints or S1 gnatures-of known tox.cants or to*, 
cant famu.es may. ,n the future, be used to .oentifv We potent.il tox.c.rv of new drugs etc In this exam- 
ple, tne genetic s.gnature of ,.« compound I ,s ,dent,cal to that of known peroxisome proh erawrs 

«Z?n T 11 COmBOund ( 2 d 4 0M not ™< h "V k "°*n '"'Can, .am.lv Based on these r.S «« 
compound 2 would be warned for further testing and test compound 1 would Oe ehm.n.ted 
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producing :hips in-house In view or rhis. 
researcners at the RTD'NHEERL initiated 
tnr EPAMAC. Tms consortium Dnnss 
toeetner scientists rrom the EPA and a num- 
ber of extramural labs with the aim or devel- 
oping microarrav capability throueh the shar- 
ing or resources and data. EPAMAC 
researchers are primariU interested in the 
developmental and toxicologic changes seen 
in testicular and breast tissue, and a portion 
or" the worKsnop was set aside ror EPAMAC 
members to share their ideas on how the 
experimental application or microarravs could 
facilitate their research. One of the central 
areas of interest to EPAMAC members is the 
effect of xenobiotics on male fertilirv and 
reproductive health. Or greatest concern is 
tne effect of exposure during critical periods 
of development and germ cell differentiation 
'9l, and how this mav compromise sperm 
counts and quaiirv following sexuai matura- 
tion [10). As well as spermatogenic tissue, 
there is also interest in how residual mRNA 
found in mature sperm ill) could be used as 
an indicator of previous xenobiotic effects (it 
is easier to obtain a semen sample than a tes- 
ticular biopsvj. Arravs will be used to examine 
and compare the effect of exposure to heat 
and chemicals in testicular and epididvmal 
sene expression profiles, with the aim of 
establishing relationships/associations 
between changes m developmental landmarks 
and the effects on sperm count and quality. 
Cluster, pattern, and other analysis ot such 
data should help identify hidden relationships 
between genes that mav reveal potential 
mechanisms of action and uncover roles for 
genes with unknown functions. 

Summary 

The rull impact or DNA arravs mav not be 
*een ror several vears. but the interest snown at 
:nis rraonai worksnoD indicaxes tnc high ievei 
.)r interest that tnev roster. Apart from educat- 
ing and advertising tne various tecnnoiogies in 
:nis held, this workshop brought together a 
number or researchers from the Research 
Triangle Park area who are alreadv using DNA 
ura vs. The interest in shanng ideas and experi- 
ences led to the initiation ot a Triangle array 
user s group. 



Arrav technology is still in its mrancv. This 
means that tne hardware is stili improving and 
mere is no current consensus ror standard pro- 
cedures, quantitation, and interpretation. 
Consistency m s porting and scanning arravs is 
not vet optimized, and this is one or the most 
critical requirements of anv experiment. In. 
addition, one of tne dark regions or arrav tech- 
nology — strife in the courts over who owns 
what portions of it — has runner muddled the 
future and is a potential barrier toward the 
development of consensus procedures. 

Pernaps the greatest hurdie for the applica- 
tion or arravs is the actual interpretation of 
data. No specialists in biomformatics attended 
the workshop, targelv because thev are rare and 
because as vet no one seems dear on the best 
method of approaching data analysis and inter- 
pretation. Cross-referencing results from mul- 
tiple expenments (time, dose, repeats, different 
animais, different species) to identity common- 
Iv expressed genes is a great challenge. In most 
cases, we are still a long wav from understand- 
ing how riie expression or gene A' is related to 
the expression or gene }". and ordering gene 
expression to delineate causal relationships. 

To the ordinary scientist in the typical lab- 
oratory, however, the most immediate prob- 
lem is a lack of affordable instrumentation. 
One can purchase premade membranes at 
relatively affordable prices. .Although these 
may be useful in identifying individual genes 
to pursue in more detail using other methods, 
the numbers that would be required tor even a 
small routine toxicology experiment prohibit 
this as a truly viable approach. For the toxico!- 
ogist. there is a need to earn- out multiple 
experiments — dose responses, time curves, 
multiple animais. and repeats. Glass-based 
DNA arravs are most attractive in this context 
because thev can be prepared in large batches 
from the same DNA source and accommo- 
date control and treated samples on tne same 
chip. .Another problem with current off-the- 
shelf arravs is that thev orten do not contain 
one or more or the particular genes a croup is 
interested in. One alternative is to obtain 
and/or produce a set of custom clones and 
have contract printing or membranes or slides 
earned out by a company such as Genomic 
Solutions. Inc. (Ann .Arbor. MI). This approach 
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is less expensive :har. .iwr.z ^u: :„ - 
one s own entire sv^err. a.tnouiir* a: >or~. 
point it mien: maxr ecor.^m.^ i^r^^ u rr~ ■ 
one s own arravs. 

Finally. DNA arravs are currentiv J rear: 
error:. Thev are a teennoior- that usei j \mj: 
ranee or skiiis mcjudinc ensineennc statisucs 
moiecular bioiocr*. cnemistrv, and bioinror- 
matics. Because most individuals are skilled in 
onlv one or pernaps rwo o: these areas, it 
appears that success with arravs mav oe Dest 
expected bv teams or collaborators consisting 
ot individuals havtnc each o: thestr skjiis 

Those considering arrav applications mav 
be amused or goaded on bv me following 
quote rrom Forvunt magazine \ 12 i .\ 

Microprocessors ruvr resrupca our cconom \ . 
SDawncu vast rorruncs inJ cr.inc<rO tne wav <*e [ lvc 
Gene chips cou*c dc ncn Dis^rcr 

Altnough this comment mav have been 
designed to excite the imagination rather than 
accurately reflect the truth, it ts fair to sav that 
the age or functional genomics is upon us. 
DNA arravs look set to be an important tool in 
this new age or" biotechnology and will likely 
contribute answers to some ot toxicology's 
most fundamental questions. 
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Subject: RE: [F*d: Toxicology Chip] 
Date: Mon. 3 Jul 2000 08:09:45 -0400 
From: "Afshan. Cynthia" <afshandniehs.nih.gov> 
To: '"Diana Hamlet-Cox*" <dianahc@incyie.com> 

You car. see the list cf clones that we have or. cur II:-: rr.ir a: 

hctp: r.ar.ue. . r.:er.s . r.ir. . go-- maps g-uest cl cr.esrcr. cfr 

We selected a suose: cf genes (2000K) tnat we oe.ieved critical :: 

respcr.se and oasic cellular processes and added a set cf clones and 137= _ : 

mis. We nave included a set cf control genes (BO-i tnat were selected" r" ' 

tne !"-JG?.l because they did not cnange across a large set cf arrav 

experiments. However, we nave found that some of tnese genes or.ar.se 

signficantly after tox treatments and are m the process c: loo.nmr a: tne 

variation cf each of tnese 80* genes across our experiments . 

Our chips are constantly changing and being updated and we nope tnat cur 

data will lead us to what the toxchip should really be. 

Z hope this answers your question. 

Cindy Afshan 



> Front: Diana Hamiez-Cox 

> Sent; Monday, June 26, 2000 8:52 PM 

> To: afshari&mehs . nih . gov 

> Subjecz: [Fwd: Toxicology Chip] 
> 

> Dear Dr. Afshari, 
> 

> Since Z have no: yet had a response from Bill Grigg, perhaps he was no: 

> zhe righz person zo conzacz. 
> 

> Can you help me in this matter* I don't need zo know zhe sequences. 

> necessarily, buz I would like very much zo know whaz types of sequences 

> are being used, e.g., GPCRs (more specific?) . ion channels, etc." 
> 

> Diana Hair.lez-Cox 
> 

> Original Message 

> Subject; Toxicology Chip 

> Daze: Mon, 19 Jun 2000 18:31:48 -0700 

> From: Diana Hamlet -Cox <dianahc9incyte . com> 

> Organization : Zncyze Pharmaceuzicals 

> To: grigg&mehs . nih .gov 
> 

> Dear Colleague: 

> 

> - am come ^izerazure research or, tne use cf expressed zer.es as 

> pharmacozoxicology markers, and found :ne Press Release dazed Fecruarv 

> 29, 200C regarding zhe work of zhe KZZHS m this area. 1 would liKe :c 

> know if zhere is a resource I can access (or you could provide 71 :na : 

> would give me a lisz of zhe 12,000 genes zhaz are on your Human ToxChiv 

> Microarray. In parzicular, Z am interested m zne crizeria used zo 

> select sequences for zhe ToxChip, including any control sequences 

> included m zne microarray . 



07/31/7000 10 M AM 



> This wa;i message is ror z.ie scue use c: zr.e 1r.zer.2ez re:;r.e-: 5 dr. 

> may cor.zair. ccr.fiderzial ard privileged ir.fcrr.azi or. suc^ecz :: 

> a z z crnev- cli er.z privilege. Ar.y ^.ajzr.criced review, ~se . disci csire c: 

> diszrib-zicr. is pccncized . If yo~ are r.cz zne ir.zer.ded recipier.z . 

> please ccr.zacz zr.e ser.der by reply er.ail ar.d deszrcy all ccpies cf zr.= 

> original message. 



07/31/2000 \0 M AM 
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The Relationship between Protein Structure and 
Function: a Comprehensive Survey with Application 
to the Yeast Genome 

Hedi Hegyi and Mark Gerstein* 

Department of Molecular For most proteins in the genome databases, function is predicted via 

Biophysics & Biochemistry sequence comparison. In spite of the popularity of this approach, the extent 

Yale University, 266 Whitney to which it can be reliably applied is unknown. We address this issue by 
Avenue, PO Box 208114 systematically investigating the relationship between protein function and 

New Haven, CT, 06520 USA structure. We focus initially on enzymes functionally classified by the 

Enzyme Commission (EC) and relate these to by structurally classified 
domains the SCOP database. We find that the major SCOP fold classes 
have different propensities to carry out certain broad categories of func- 
tions. For instance, alpha /beta folds are disproportionately associated with 
enzymes, especially transferases and hydrolases, and all-alpha and small 
folds with non-enzymes, while alpha + beta folds have an equal tendency 
either way. These observations for the database overall are largely true for 
specific genomes. We focus, in particular, on yeast, analyzing it with many 
classifications in addition to SCOP and EC (i.e. COGs, CATH, MIPS), and 
find clear tendencies for fold-function association, across a broad spectrum 
of functions. Analysis with the COGs scheme also suggests that the func- 
tions of the most ancient proteins are more evenly distributed among 
different structural classes than those of more modern ones. For the data- 
base overall, we identify the most versatile functions, i.e. those that are 
associated with the most folds, and the most versatile folds, associated 
with the most functions. The two most versatile enzymatic functions 
(hydro-lyases and O-glycosyl glucosidases) are associated with seven folds 
each. The five most versatile folds (TIM-barrel, Rossmann, ferredoxin, 
alpha-beta hydrolase, and P-loop NTP hydrolase) are all mixed alpha-beta 
structures. They stand out as generic scaffolds, accommodating from six to 
as many as 16 functions (for the exceptional TIM-barrel). At the conclusion 
of our analysis we are able to construct a graph giving the chance that a 
functional annotation can be reliably transferred at different degrees of 
sequence and structural similarity. Supplemental information is available 
from http://bioinfo.mbb.vale.edu/ genome/ fold func. 

c Academic Press 

kc\ncori1>: structure-function; fold classification; structural convergence; 
\ ( \"c-/v>,\//>;y iiuth^t functional divergence; veast genomics 



Introduction 

The problem of determining function 
from sequence 



ducts in a genome. However, the function of only 
a minor fraction of proteins has been studied 
experimentally, and, typically, prediction of func- 
tion is based on sequence similarity with proteins 
of known function. That is, functional annotation 
^ transferred based on similarity I'nfnrtunatelv. 
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Figure 1. Specific example of 
convergent and divergent evol- 
ution. Top, an example of conver- 
gent evolution, showing structures 
of two carbonic anhyd rases with 
the same enzymatic function (EC 
number 4.2.1.1), but with different 
folds. The Figure was drawn with 
Molscript (Kraulis, 1991) from 1TH] 
(left-handed beta helix) and 1DMX 
(flat beta sheet). Bottom, an 
example of possible divergent evol- 
ution, the TIM-barrel. This fold 
functions as a generic scaffold cata- 
lyzing 15 different enzymatic func- 
tions. A schematic Figure of the 
TIM-barrel fold is shown with 
numbers in boxes indicating the 
different location of the active site 
in four proteins that have this fold. 
These four proteins, xylose isomer- 
ase, aldose reductase, enolase, and 
adenosine deaminase, carry out 
very different enzymatic functions, 
in four of the main EC classes 
(1.*.*, 3.*.*, 4.*.*, and 5.*.*). They 
have active sites at very different 
locations (identified by the boxed 
numbers in the barrel) yet they all 
share the same fold. 



progressively corrupt genome databases through 
the problem of accumulating incorrect annota- 
tions and using them as a basis for further anno- 
tations, and so on. 

It is known that sequence similarity does confer 
structural similarity. Moreover, there is a well- 
established quantified relationship between the 
extent of similarity in sequence and that in struc- 
ture. First investigated by Chothia & Lesk (1986) 
the similarity between the structures of two pro- 
teins (in terms of RMS) appears to be a monotonic 
function of their sequence similarity. This fact is 
often exploited when two sequences are declared 
related, based on a database search by programs 
such as BLAST or FastA (Altschul ct al, 1997; 
Pearson, 1996). Often, the only common element in 
two distantly related protein sequences is their 
underlying structures, or folds. 

Transitivity requires that the well-established 
relationship between sequence and structure, and 
the more indefinite one between sequence and 
function, imply an indefinite relationship between 
structure and function. Several recent papers have 
highlighted this, analyzing individual protein 
superfamilies with a single fold but diverse func- 
tions. Examples include the aldo-keto reductases, a 
large hydrolase super family, and the thiol protein 



esterases. The latter include the eye-lens and cor- 
neal crystallins, a remarkable example of functional 
divergence (Bork & Eisenberg, 1998; Bork ct al., 
1994; Cooper ct al, 1993; Koonin & Tatusov, 1994; 
Seery ct al, 1998). 

There are also many classic examples of the con- 
verse: the same function achieved by proteins with 
completely different folds. For instance, even 
though mammalian chymotrypsin and bacterial 
subtilisin have different folds, they both function 
as serine proteases and have the same Ser-Asp-His 
catalytic triad. Other examples include sugar 
kinases, anti-freeze glycoproteins, and lysvl-tRNA 
synthetases (Bork ct al, 1993; Chen ct al., 1997; 
Doolittle, 1994; Ibba ct al., 1997a,b). 

Figure 1 shows well-known examples of each of 
these two basic situations: the same fold but differ- 
ent function (divergent evolution) and the same 
function but different fold (convergent evolution). 

Protein classification systems 

The rapid growth in the number of protein 
sequences and three-dimensional structures has 
made it practical and advantageous to classify pro- 
teins into families and more elaborate hierarchical 
systems. Proteins are grouped together on the 
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basis of structural similarities in the FSSP (Holm & 
Sander, 1998), CATH (Orengo ct al, 1997), and 
SCOP databases (Murzin ct al, 1995). SCOP is 
based on the judgments of a human expert FSSP, 
on automatic methods, and CATH, on a mixture of 
both. Other databases collect proteins on the basis 
of sequence similarities to one another, e.g. PRO- 
SITE SB ASE, Pfam, BLOCKS, PRINTS and Pro- 
Dom (Attwood ct al, 1998; Bairoch ct al, 1997; 
Corpet ct al, 1998; Fabian ct al, 1997; Hemkoff 
ct al, 1998; Sonnhammer ct al, 1997). Several col- 
lections contain information about proteins from a 
functional point of view. Some of these focus on 
particular organisms, e.g. the MIPS functional cata- 
logue and YPD for yeast (Mewes ct al, 1997; 
Hodees ct al, 1998) and EcoCyc and GenProtEC 
for Escherichia coli (Karp ct al, 1998; Riley, 1997). 
Others focus on particular functional aspects in 
multiple organisms, e.g. the WIT and KEGG 
databases, which focus on metabolism and path- 
ways (Selkov ct al, 1997; Ogata ct al, 1999), the 
t-x ^/Kiz-K fnriiQpt; nhviouslv 
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enough on enzymes (Bairoch, 1996), and the 
COGs system, which focuses on proteins con- 
served over phylogenetically distinct species 
(Tatusov ct al, 1997). The ENZYME database, in 
particular, contains all the enzyme reactions that 
have an Enzyme Commission (EC) number 
assigned in accordance with the International 
Nomenclature Committee and is cross-referenced 
with Swissprot (Bairoch, 1996; Bairoch & 
Apweiler, 1998; Barrett, 1997). 

Our approach: systematic comparison of 
proteins classified by structure with those 
classified by function 

One of the most valuable operations one can 
do to these individual classification systems is to 
cross-reference and cross-tabulate them, seeing 
how they overlap. We performed such an anal- 
ysis here by systematically interrelating the 
SCOP, Swissprot and ENZYME databases 
(Bairoch, 1996; Bairoch & Apweiler, 1998; Murzin 
ct a j f 1995). For veast we also have used the 
MIPS veast functional catalogue, CATH and 
COGs in our analysis. Phis enables us to investi- 
gate the relationship between protein function 
and structure in a comprehensive statistical 
fashion. In particular, we investigated the func- 
tional aspects of both divergent and convergent 
evolution, exploring cases where a structure gains 
a dramatically different biochemical function and 
finding instances of similar en/vmatic functions 
performed bv unrelated structures 

We concentrated on single-domain Swissprot 



Recent related work 

This work is following up on several recent 
reports on the relationship between protein struc- 
ture and function. In particular, Martin ct al (1998) 
studied the relationship between enzyme function 
and the CATH fold classification. They concluded 
that functional class (expressed by top-level EC 
numbers) is not related to fold, since a few specific 
residues, not the whole fold, determine enzyme 
function. Russell (1998) also focused on specific 
side-chain patterns, arguing that these could be 
used to predict protein function. In a similar 
fashion, Russell ct al (1998) identified structurally 
similar "supersites" in superfolds. They estimated 
that the proportion of homologues with different 
binding sites, and therefore with different func- 
tions, is around 10%. In a novel approach, using 
machine learning techniques, des jardins ct al 
(1997) predict purely from the sequence whether a 
given protein is an enzyme and also the enzyme 
class to which it belongs. 

Our work is aiso motivated by recent worK iGCm 
ing at whether or not organisms are characterized 
by unique protein folds (Frishman & Mewes, 1997; 
Gerstein, 1997, 1998a,b, Gerstein & Hegyi, 1998; 
Gerstein & Levitt 1997). If function is closely 
associated with fold (in a one-to-one sense), one 
would think that when a new function arose in 
evolution, nature would have to invent a new fold. 
Conversely, if fold and function are only weakly 
coupled, one would expect to see a more uniform 
distribution of folds amongst organisms and a 
high incidence of convergent evolution. In fact, a 
recent study on microbial genome analysis claims 
that functional convergence is quite common 
(Koonin & Galperin, 1997). Another related paper 
systematically searched Swissprot for all such cases 
of what is termed "analogous" enzymes (Galperin 
ct al, 1998). 

Our work is also motivated by the recent work 
on protein design and engineering which aims to 
rationally change a protein function, for instance, 
to engineer a reporter function into a binding pro- 
tein (Hellinga, 1997, 1998; Marvin ct al, 1997). 



Results 

Overview of the 8937 single-domain matches 

Our basic results were based on simple sequence 
comparisons between Swissprot and SCOP, the 
SCOP domain sequences being used as queries 
against Swissprot. We focused on "mono-func- 
tional" single-domain matches in Swissprot, i.e. 
those singe-domain proteins with only one anno- 
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terns in Swissprot, 19,995 are enzymes, 18,317 are 
structural homologues, and 8205 are both.) About 
half of the fraction of Swissprot that matched 
known structures were "single-domain" and about 
one- third of these were enzymes (8937 and 3359, 
respectively, of 18,317). We focus on these 8937 
single-domain matches here. Notice how these 
numbers also show how the known structures are 
significantly biased towards enzymes: 45% (8205 
out of 18,317) of all the structural homologues are 
enzymes versus 29% (19,995 out of 69,113) for all 
of Swissprot. 

331 observed fold-function combinations 

Figure 2 gives an overview of how the matches 
are distributed amongst specific functions and 
folds. The single-domain matches include 229 of 
the 361 folds in SCOP 1.35, and 91 of the 207 three- 
component enzyme categories in the ENZYME 
database (Bairoch, 1996). Each match combines a 
SCOP fold number on the structural side (columns 
in Figure 2) and a three-component EC category on 
the functional side (rows), with all the non-enzy- 
matic functions grouped together into a single cat- 
egory with the artificial "EC number" of 0,0.0 
(shown in the first row in Figure 2). This results in 
a table where each cell represents a potential fold- 
function combination. The table contains a maxi- 



mum of 21,068 (=229 x 92) possible fold-function 
combinations (and a minimum of 229 combi- 
nations, assuming only one function for every 
fold). We actually observe 331 of these combi- 
nations (1.6%, shown by the filled-in cells). 

Overall, more than half of the functions are 
associated with at least two different folds, while 
less than half of the folds with enzymatic activity- 
have at least two functions (51 out of 91 and 53 out 
of 128, respectively). 

Summarizing the fold-function combinations 
by 42 broad structure-function classes 

As listed in Table 1, folds can be subdivided in 
six broad fold classes (e.g. all-alpha, all-beta, 
alpha/beta, etc.). Likewise, functions can be bro- 
ken into seven main classes, non-enzymes plus six 
enzyme classes, e.g. oxidoreductase, transferase, 
etc. This gives rise to 42 (6 x 7) structure- function 
classes. The way the 21,068 potential fold-function 
combinations are apportioned amongst the 42 
classes is shown in Table 2A. 

Table 2B shows the way the 331 observed combi- 
nations were actually distributed amongst the 42 
classes. Comparing the number of possible combi- 
nations with that observed shows that the most 
densely populated region of the chart is the trans- 
ferase, hydrolase and lyase functions in combi- 
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Figure 2. Overview of all the single-domain matches between proteins in Swissprot 35 and domains in SC OT 1.35 
Sequences were compared with 1*1. AST using the match criteria described in Materials and Methods The matches are 
clustered into 92 functions (based on three-component EC numbers), which are arranged on each row, and 229 folds 
(based on SCOP told numbers), which are arranged on each column. The first row indicates the matches with non- 
enzymes. There are, thus, 21,1)68 (-^2 x 22 L )) possible combinations shown in the Figure. Only the 331 are actually 
observed. These are indicated by filled squares 
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Table 1. Broad structural and functional categoric 
A functional ctitc$onr> m Swi>^pwt 35* 



EC category 

0.0.0 



Category name 



3 * ' 

4 /' 



6.V 



Non-enzyme 

Oxidoreduc tase 

Transferases 

Hydrolases 

Lyases 

lsomerases 

Lipases 



B Structural c/rtsst's i« SCOP i.3n 



Fold class 



Class name 



All-alpha 
All-beta 
Alpha and beta 
Alpha plus beta 
Multi-domain 
Transmembrane 
Small proteins 



Abbreviation 



NONENZ 
OX 

TRAN 

HYD 

LV 

ISO 

L1C 

Total: 



Num. of functions in category 
1 

86 
28 
53 
15 
16 
9 

208 



Abbreviation 



A 
B 
A/B 
A + B 
MULT1 
TM 
SML 
Total: 



Num. of folds in class 

81 

57 
70 
91 
19 
9 

43 
361 



" • List of the functional (enzymatic) categories in Swissprot and the abbreviations used here. The values denote the number of 
lh MT^ Here, and the abbreviations used for the classes. Values denote the number of folds 
in each class in SCOP 1 .35. Class 6 is not used in the analysis. ^ _ 



nation with the alpha/beta fold class. This notion 
is in accordance with the general view that the 
most popular structures among enzymes fall into 
the alpha /beta class. In contrast, matches between 
small folds and enzymes are almost completely 
missing, except for five folds in the oxidoreductase 
category. There are also no all-alpha ligases and 
only one all-alpha isomerase. 

Table 2C and D break down the 331 fold-func- 
tion combinations in Table 2A into either just a 
number of folds or just a number of functions. 
That is, Table 2C lists the number of different folds 
associated with each of the 42 structure-function 
classes (corresponding to the non-zero columns in 
the relevant class in Figure 2), and Table 2D does 
the same thing for functions (non-zero rows in 
Figure 2). Comparing these tables back to the total 
number of combinations (Table 2A) reveals some 
interesting findings, keeping in mind that more 
functions than folds reveals probable divergence 
and that more folds than functions reveals prob- 
able convergence. For instance, the alpha/beta and 
alpha ■+ beta fold classes contain similar numbers 
of folds, but the alpha /beta class has relatively 
more functions, perhaps reflecting a greater diver- 
gence. (Specifically, the alpha /beta class has 73 
folds and 56 functions, while the alpha 4- beta class 
has 67 folds but only 35 functions.) 

Table 2F shows the number of matching Swis- 



obviously, affected by the biases in Swissprot. On 
the other hand, if we compare the total matches in 
Table 2E with the total combinations in Table 2B it 
is clear that the numbers do not directly correlate. 
For instance, fewer hydrolases in Swissprot have 
matches with alpha /beta folds than with alpha 
+ beta folds (295 versus 452), but the number of 
different combinations in the first case is 30, as 
opposed to only 18 in the second case. This 
suggests that our approach of counting combi- 
nations may not be as affected by the biases in the 
databanks as simply counting matches. 

Table 2F and G give some rough indication of 
the statistical significance of the differences in 
the observed distribution of combinations. In 
Table 2F, using chi-squared statistics, we calculate 
for each individual structure class the chance that 
we could get the observed distribution of fold- 
function combinations over various functional 
classes it fold was not related to function. Then 
in Table 2G, we reverse the role of fold and 
function, and calculate the statistics for each 
functional class. 

Enzyme versus non-enzyme folds 

On the coarsest level, function can be divided 
amongst enzvmes and non-enzymes. Of the 22^ 
folds present' in Figure 2, 93 are associated only 
. \th . ^ ^•••w- inH 101 arc a^oriated nnlv with 
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broadest functional categories. The distribution is 
far from uniform. The all-alpha fold class has 30 
non-enzvmatic representatives, but only 12 purely 
enzymatic folds and four folds with "mixed" (both 
types of) functions. This implies that a protein with 
an all-alpha fold has a prion roughly twice the 
chance of having a non-enzymatic function over an 



enzymatic one. The all-beta fold class has six enzy- 
matic, 17 non-enzymatic and 13 mixed folds. In the 
alpha /beta class, 34 folds are associated only with 
enzymes and five folds only with non-enzymes, 
whereas in the alpha + beta class this ratio is more 
balanced., 28 "purely" enzymatic folds versus 22 
purely non-enzymatic ones. 



Table 2. Statistics over 42 structure-function classes 



A Number of possible combination* between fold* and functions in each of 42 classes (number of cells in Figure 2) 

A B A/B A + B MULTI ' SML Sum 



NONENZ 


Ah 


36 


48 


56 


15 


28 


229 


OX 


1104 


864 


1152 


1344 


360 


672 


5496 


THAN 


598 


468 


624 


728 


195 


3f>4 


2977 


HYD 


1334 


1044 


1392 


1624 


435 


812 


6641 


LA 


414 


324 


432 


504 


135 


252 


2061 


ISO 


460 


360 


480 


560 


150 


280 


2290 


LIG 


27h 


216 


288 


336 


90 


168 


1374 


Sum 


4232 


3312 


4416 


5152 


1380 


2576 


21,068 


B. Number of obsor 


rd combinations 


between folds and functions in 


each of 42 classes 


(number of filled cells 


in Figure 2) 






A 


B 


A/B 


A + B 


MULTI 


SML 


Sum 


NONENZ 


34 


30 


14 


28 


4 


26 


136 


OX 


13 


5 


17 


3 


4 


5 


47 


TRAN 


3 


3 


16 


8 


5 




35 


HYD 


4 


11 


30 


18 


4 




67 


L^ 




3 


13 


5 






23 


ISO 


1 


2 




4 


2 




16 


LIG 








3 


1 




7 


Sum 


57 


55 


9u 


69 


20 


31 


331 


C. Number of folds 


m each of the 42 


classes (columns with a filled cell m Figure 2) 










A 


B 


A/B 


A + B 


MULTI 


SML 


Sum 


NONENZ 


34 


30 


14 


28 


4 


26 


136 


OX 




5 


Q 


3 


3 


3 


30 


THAN 


3 


2 


15 


6 


5 




31 


HYD 


4 


8 


19 


18 


3 




52 


L> 


-> 


3 


8 


5 






18 


ISO 


1 


-> 




4 


2 




16 


LIG 




1 


1 


3 


1 




6 


Su rn 


51 


51 


73 


67 


18 


29 


289 


D Number of functions in each of tf 


e 42 classes (rou 


> zcith a filled 


cell in Figure 2) 










A 


B 


A/B 


A + B 


MULTI 


SML 


Sum 


NONENZ 


1 


1 


1 


1 


1 


1 


6 


OX 


K 


5 


9 


3 


3 


5 


33 


TRAN 


T 


3 


13 


8 


4 




30 


HYD 


4 


7 


19 


14 


4 




48 


1A 


•) 


~> 




3 






14 


ISO 


1 


2 


5 


4 


1 




13 


LIG 




1 


2 


"> 


1 




6 


Sum 


18 


21 


56 


35 


14 


6 


150 


F. Total number of matching Swiss pro t sequences in i 


wch of the 42 


fold -timet ion classes 








A 


B 


A/B 


A + B 


MULTI 


SML 


Sum 


NONENZ 


1940 


1159 


560 


638 


106 


892 


5295 


OX 


150 


202 


388 


50 


68 


18 


876 


TRAN 


65 


14 


363 


116 


174 




732 


HYD 


1 In 


394 


295 


452 


92 




1349 


LY 


40 


47 


168 


104 






359 


ISO 


~> 


54 


122 


22 






202 


LIG 




5 


26 


69 


24 




124 


Sum 


23 1 3 


18 o 


1922 


1451 


466 


910 


8937 


F. How much does 


\h'h of the fold classes droiatc from the avernge 

n 


tistnbution of fun 


etions 7 






A 




<[).()! 












B 




<0 6 












A/B 


32.5 


<n oooo: 












A - B 
















M IT TI 




<.() 2 













SML 27.8 <().OO02 
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Table 2 — ContniueJ 

C, Ho:o much Jo each o' the futu tion iln^e> denote front the tiverti^e Ji>trihutiou of foLi* 1 



NONTNZ 


40. r 


/> 

<0 0(HXXK)2 


OX 


4.4 


«:0 08 


IRAN 


13.1 


<-0.03 


HYD 


17 3 




LY 


10.2 


^0.08 


ISO 
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This Table shtnvs yanous totals from Figure 2 distributed among the 42 structure-function classes, i.e. the seven functional cate- 
gories in Table 1A multiplied by the six structural categories in Table IB Part A shows how many potential fold-function combina- 
tions there are in Figure 2 amongst each of the 42 classes Part B shows how many of these 21,068 possible combinations are 
actually observed Part C shows the total number of different folds (i.e. selected columns in Figure 1) in each class. Part D shows the 
total number of different functions (i.e. selected rows in Figure 2) in each class Part E shows the total number of matching Swissprot 
proteins in the 42 classes Note that to observe a fold-function combination one only needs the existence of a single match between a 
Swissprot protein and a SCOP domain However, there can be many more That is why the totals in this Table sum up to so much 
larger an amount than 331. 

Here is an example of how to read parts A to E of the Fable, focussing on the all-alpha, oxidoreductase region. Part A shows that 
there are 1104 cells, filled or unfilled, tn this region, corresponding to possible combinations. Part B shows that 13 of these 1104 cells 
are filled, corresponding to observed all-alpha, oxidoreductase combinations. Part C shows that there are seven folds, corresponding 
to columns with filled cells in this region Part D shows that there are eight functions, corresponding to rows with filled cells in this 
region. Finally, in part F we find that there are 150 Swissprot entries that have matches with a SCOP domain. They correspond to 
the 13 observed combinations in Part B. 

Parts F and G give information on the statistical significance of the differences observed between the 42 structure-function classes. 
Part F gives the significance that the observed distribution of fold-function combinations in a given functional class is different than 
average (i.e. the null hypothesis that distribution of fold-function combinations is the same in each functional class). This is very 
similar to the derivation by Martin ct a!. (1998) A chi-squared statistic is computed for each of the seven functional classes in the 
conventional way: y 2 {f) — (C\, — C^) 2 /£^ > where for a given functional class / and structure class s, 0_, is the observed number of 
fold-function combinations and E v is the expected number. L >t is simply computed from scaling the "sum" column and row in Part 
B of the Table: E„, - T JJ1 , where 7", is the total number of combinations in a given structural class s (sum row), T f is the total num- 
ber of combinations in a given functional class f (sum column), and T is the total observed number of combinations, 331. Part C 
gives the statistical significance that the observed distribution of fold-function combinations in a given structural class is different 
than average. To compute this one simply sums over functions instead of structures: y 2 (s) = E/O s/ - E^) 2 /E^ After each chi-squared 
statistic is reported, a rough probability or P-value is given This gives the chance the observed distribution could be obtained ran- 
domly. 



Restricting the comparison to 
individual genomes 

Figure 3(a) applies to all of Swissprot. Figure 3(b) 
and (c) shows the functional distribution of folds 
taking into account the matches only in two 
specific genomes, yeast and E. coli. Only a fraction 
of each genome could be taken into consideration 
for various reasons (156 proteins in yeast, 244 pro- 
teins in E coli), mostly due to the great number of 
enzymes having multiple domains in both veast 
and f coli. Chi-squared tests show that the fold 
distribution in veast does not differ significant! v 
from that in Swissprot and that the one in f. coh 
differs only slightly (/> < 0.23 and P < 0.02, respect 
lvelv). The mam difference between Swissprot and 
P. coli is the larger fraction of alpha /beta enzy- 
matic folds in the latter (34/^3 versus 26/49). There 
are also somewhat more non-en/vmatic all-alpha 
and small folds in Swissprot than in the two gen- 
omes. This is principally due to the greater preva 
lence of globins, myosins, cytochromes, toxins, and 



The yeast genome viewed from different 
classification schemes 

In Figure 4 we focus on the yeast genome in 
more detail, trying to see the effect that different 
classification schemes have on our results. 
Although the total number of counts for our stat- 
istics decrease, in just using yeast relative to all of 
Swissprot, yeast provides a good reference frame 
to compare a number of classification schemes in 
as unbiased a fashion as possible. Also, yeast is 
one of the most comprehensively characterized 
organisms, and there are a number of functional 
classifications available exclusively for this organ 
ism. 

In part Figure 4(a) we cross-tabulate the struc- 
ture-function combinations in veast using the 
SCOP and FC systems as we have done for all of 
Swissprot in Fable 2B. The veast distribution is 
fairly similar to that of Swissprot with the only 
major difference being somewhat more alpha /beta 
transferases and fewer alpha /' beta hydrolases than 
expected. (A chi-squared test gives P < 0.05 for 
f K' two distributions to differ If either the transfer 



hution turns out to he similar to that or ^ui^pn>* tu atior. (Orengo <: .,\ O k hh mMeaO of ^ ( M h. 
(data not show n j this I l^ure \\ e mapped the OF classifu a tion ot 



neiaaonsnip oeiween rroiein sirucwre ana t-uncuon • 
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yeast PDB match to its corresponding CATH 
classification and then cross-tabulated the struc- 
ture-function combinations in the various classes. 
Essentia lly, this Figure shows the results reported 
by Martin et al. (1998) just for yeast. 

In Figure 4(c) and (d), which show COGs versus 
SCOP cross- tabulations, we achieve the opposite of 
(b). We change the functional classifications 
scheme but keep SCOP for classifying structures. 
As was the case with the enzyme classification, but 
perhaps even more so, using COGs to classify 
function shows clearly that certain fold classes are 
associated with certain functions and vice versa. 
Most notably, whereas the functions associated 
with metabolism, which are mostly enzymes, are 
preferentially associated with the alpha/beta fold 
class, those associated with cellular processes (e.g. 
secretion) and information processing (e.g. tran- 
scription), show no such preference. They, in fact, 
show a marked preference for all-alpha structure. 
Small proteins are absent from most of the COGs 
classes, except one part of information processing 
and two in cellular processes. 

The COGs system classifies functions for those 
proteins that have clear orthologues in different 
species. Thus, conclusions based on using yeast 
COGs should be readily applicable to other gen- 
omes. This point is highlighted in Figure 4(d), 
which shows a COGs versus SCOP classification 
for only the 110 COGs that are conserved across all 
the analyzed genomes (eight) and all three king- 
doms. Thus, this sub-figure would appear exactly 
the same for E. coli, Methcuwcoccos jannaschii or a 
number of other genomes. It clearly shows how 
much more common the information processing 
proteins are among the most conserved and 
ancient proteins. Moreover, note how these most 
ancient proteins appear to have less of a preference 
for a particular structural class than the "more 
modern" metabolic ones. This suggests that large- 
scale duplication of alpha /beta folds for use in 
metabolism is what gave rise to stronger fold-func- 
tion association in Figure 3(c). 



Figure 3. Chart with breakdown among structure- 
function classes in two genomes. Charts and 
Tables showing the number of folds in each fold class 
associated with only enzymatic (ENZ), only non-enzy- 
matic (nonFNZ), and both enzymatic and non-enzy- 
matic functions (Both). The results are shown for (a) all 
of Swissprot, (b) for just the veast genome, and (c) for 
just the E colt genome. The results for individual 
domains in a minimum set of SCOP domains also sup- 
port these tendencies (data not shown). The numbers in 
(b) are not based on the PS1 -blast protocol used for 
Figure 4. Rather thev are found just as "subsets" of the 
overall Swissprot results to make them readilv compar- 
able with the rest of the paper. Because of this the num- 
bers in this Figure will not match exactly those in 
Figure 4, the difference having to do with the greater 
number of fold-function combinations found bv PSF 
blast as compared to WL -blast 
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Figure 4. Structure-function classes in the \ st genome analy/ed through a \anety of classification schemes I h is 
1 igure shows the distribution ot told function combinations in the yeast genome as analyzed bv a \anety of different 
structure and functional classifications Kach ot the Figures is a cross-tabulation ot one structural classification scheme 
(on the column heads) vrrsu> a functional classification (row heads), (a) SCOP jvrsi/s HN'ZYML; (b) CATH vcrau^ 
f-MZYMF; <c) SCOP ivrsr/s CCX.s; (d) SCOP versus Most Conversed COC.s; (e) SCOP i»rrs»s MII^ Functional Catalo- 
gue Pach of the grid boxes gives the number ot fold function combinations within a structure-function class This 
number is expressed as a percentage ot the total number of combinations in the diagram to make the graphs readily 
comparable The total number ot combinations in each of the sub-figures is (a) 141, (b) 77, (c) 1207, (d) 120, and fe) 
hh (a) and (e) are directlv comparable with the cross tabulation in Table 2b tor all of Swissprot In (d) and (e), we 
employ the COC.s scheme in exactly the same fashion as we did the f\\ZY\1h classification We form combinations 
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Figure 5. The most versatile folds. The functions 
associated with the 16 most versatile folds are shown. 
Values in the table denote the number of matches 
between a particular fold type in pdb Q 5d (designated by 
its fold number in SCOP 1.35) and an enzyme category 
(represented bv the first three components ot the 
respective EC numbers). Here and in the following 
Tables the same parameters were used tor matching as 
in Figure 2. The numbers in the top row indicate the 
number of functions a particular fold is associated with. 
The identifiers above the fold numbers are either PDB 
or SCOP identifiers of representative structures (the lat- 
ter onlv if the PDB entry contains more than one 
domain or chain). (See the legend to I able 3 tor the syn- 
tax of SCOP identifiers.) The first row in the table with 
the artificial 0.0.0 EC number shows the number of 
matches with non-enzvmatic functions Among the two 
all-alpha tolds in the table, cytochrome P450 (1.0n3) is 
exclusively enzymatic, associated with five different 
enzyme functions, all related to cytochrome P450. Onlv 
one alpha ■+- beta fold, ferredoxin (4 031), is present in 
the table, predominantly with matches with non-enzy- 
matic ferredoxins, but also with enzvmes in four differ- 
ent enzvme classes. In the multi-domain class, beta- 
lactamase/n-ala carboxypeptidase (3 003) has the most 
matches with penicillinase (EC number 3 5.2) and only 
one match with a non-enzyme, which also binds penicil- 
lin but has no enzymatic activity (Coquc ct al., 1993). 
The class of small domains is represented onlv with one 
fold, membrane-bound rubredoxin-like (7.035), and has 
matches onlv with enzvmes It is possible that some pro- 
teins classified as "non-enzymes'' mav indeed be 
enzvmes, missing the corresponding EC number. In this 
case, our analysis mav be potentially useful in pointing 
to which non-enzymes may actually be enzymes. 



Eigure 4(e) shows another functional classifi- 
cation scheme, the MIPS Yeast functional catalogue 
(Mewes ct al, 1997). Unlike the COGs scheme, this 
has the advantage of being applicable to every 
yeast open reading frame (ORF). However, it has 
many more categories and about a third of the 
yeast ORFs are classified into multiple categories 
(sometimes five or more), making interpretation of 
the results a bit more ambiguous. 

The most versatile folds and the most 
versatile functions 

Returning to considerations of all of Swissprot, 
Figure 5 lists the 16 most versatile folds. The top 
five are the TIM-barrel, the alpha -beta hydrolase 
fold, the Rossmann fold, the P-loop containing 
NTP hydrolase fold, and the ferredoxin fold. Four 
of these are alpha /beta folds and one is alpha 
+ beta. All five have non-enzymatic functions as 
well as five to 15 enzymatic ones. The most versa- 
tile folds include four all-beta and two all-alpha 
folds. 

Figure 6 lists the 18 functions that have the most 
different folds associated with them, each having 
at least three associated folds. The most versatile 
functions are those of glycosidases and carboxy- 
lases (3.2.1 and 4.2.1), which are associated with 
seven different fold types each, recruited from at 
least three different fold classes. The next two 
most versatile functions, the phosphoric monoe- 
ster hydrolases and the linear monoester hydro- 
lases (3.1.3 and 3.5.1), are associated with six 
different fold types each. Most of the versatile 
functions are associated with folds in completely 
different fold classes. This suggests that these 
enzymes developed independently, providing 
many examples of convergent evolution. In con- 
trast, only three functions, all oxidoreductases, 
are associated with folds in a single class (last 
three rows in Figure 6). These folds are all 
alpha/beta, namely the TIM-barrel, Rossmann, 
and flavodoxin folds. 

Specific functional convergences involving 
different folds 

Even on the level of specificitv of four-com- 
ponent EC numbers, several enzymatic functions 
are performed by unrelated structures. Figure 1 
shows a dramatic example, two different carbonic 
anhydrases with the same EC number 4.2.1.1, but 
with clearly different structures (Kisker ct al., 
1996). Table 3 shows further examples in a more 
systematic fashion. Most of these occur in differ- 
ent evolutionary lineages. For instance, the all- 
alpha vanadium chloroperoxidase occurs onlv in 
fungi, while the alpha /beta non-heme chloroper- 
oxidase occurs only in prokaryotes. Another 
example is beta-glucanase. It has as many as 
three different structural representations, from 
three different fold classes. VVhile it has an all- 
beta structure in Bacillus subtilis, it has an all- 



Relationship between Protein Structure and Function 



lb/ 



A/B 



A+B 



MULTIsml 



faUi-ul ulusg F dn|iilH » fufitHHiiIIIn; 



HI 



iliiiiiiiiiiiit 



s 8*: 



8 i 8 8 _ 
^ w « «lr- 




S in the first column). A hash (#) in any cell indicates that Us value ,s greater than 99. 



alpha variant in Bacillus arculans, and an alpha/ 
beta structure in tobacco. 

Specific functional divergences on same fold 

Quite a number of SCOP domains each have 
sequence similarity with Swissprot proteins of 
different function. We separated these into cases 
in which the structural domain has similarity to 
proteins with different enzymatic functions only 
and those in which a domain shows homology to 
both enzymes and non-enzymes (Table 4A and B, 
respectively)- Table 4A includes the well-known 
lactalbumin-lysozyme C similarity and the well- 
documented case of homology between an eye- 
lens structural protein and an enzyme (crystalhn 
and gluthathione S-transferase; Cooper ct al, 
1993- Qasba & Kumar, 1997). It includes several 



other notable divergences, such as the one 
between lysophospholipidase and galectin, and 
the one between an elastase and an antimicrobial 
protein (Morgan ct al, 1991). Remarkably, of the 
seven domains in this Table, three belong to the 
all-beta class. 

' Multifunctionality" versus e-value 

Figure 7 shows how the number of "multifunc- 
tional" domains, i.e. domains with sequence simi- 
larity to proteins with different functions, varies as 
the function of the stringency of the match score 
threshold. We used a minimal version of SCOP in 
which the structures in PDB were clustered into 
990 representative domains (see the legend to 
Figure 7). The Figure shows how the percentage of 
domains that have sequence similarity' to proteins 
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Relative number of domains with multiple functions, 
as the function of e-value threshold 



12 




0 10 20 30 40 50 60 70 

-iog(»-value) 

Figure 7. Multi-functionality versus t'-value threshold. 
The graph shows how the percentage number of multi- 
functional enzymatic domains varies as the function of 
the r-value threshold. A multi-functional domain occurs 
when a particular domain in SCOP matches domains in 
Swissprot with different enzymatic function. For these 
calculations, we had to use a more minimal version of 
SCOP than the pdb95d dataset referred to in the 

methods tn prevent double matrhes i e hvn SCQP 

domains matching a single Swissprot domain. The con- 
struction of this minima] SCOP was described pre- 
viously (Gerstein, 1998a). Basically, all the domains in 
SCOP were clustered via a multi-linkage approach into 
990 representative domains, such that no two domains 
matched each other with a Fast A t'-value better than 
0.01. 



with different functions (in terms of three-com- 
ponent EC numbers) varies with sequence simi- 
larity. This decreases approximately monotonically 
as a function of the exponent of the t'-value 
threshold. Interestingly, there is a breaking point 
around log (t'-value) - -5, as the sharply decreas- 
ing number of functions slows down and the 
matches reach the level of biological significance. 

Our graph can be loosely compared with the 
classic graph by Chothia & Lesk (1986) showing 
the relation of similarity in structure to that in 
sequence. It roughly shows the chance of func- 
tional similarity {or more precisely the chance of 
functional difference) with a given level of 
sequence similarity between an enzyme and a pro 
tein of unknown function For example, with an 
c value of 10 there is only an n"„ chance 1 that 
an unknown protein homologous to a certain 
enzyme has in fact a different function. Moreover, 
our graph is in excellent agreement with the find- 
ings by Russell ct al (1998) who also found that 
the proportion of homologues with different func- 
tions is around 10"<.. This shows that there is a 
low chance that a single-domain protein, highly 
homologous to ,i known en/vme has a different 



ing functionally characterized enzymes in Swis- 
sprot with structurally characterized domains in 
SCOP. It is a timely subject, as the number of 
three-dimensional protein structures is increasing 
rapidly and the recent completion of several 
microbial genomes highlights the need for func- 
tional characterization of the gene products and 
identification of enzymes participating in metabolic 
pathways (Koonin ct al, 1998). 

We tried to be as objective and as unbiased as 
possible, taking only enzymes with a single 
assigned function and only single-domain matches. 
We ignored Swissprot proteins with dubious or 
unknown function, or with incomplete sequence. 
Given these criteria, several tendencies are clear. 
The alpha /beta folds tend to be enzymes. The all- 
alpha folds tend to be non-enzymes and the all- 
beta and alpha + beta folds tend to have a more 
even distribution between enzymes and non- 
enzymes. 

Our analysis of proteins from yeast and E, coli 
has shown that the functional distribution of 
folds does not differ greatly from the whole of 
Swissprot. E. coli, however, appears to have 
somewhat more alpha /beta enzymes and less 
non-enzymes. 

Functional assignment complexities 

We identified four specific complexities in our 
functional assignment worth mentioning. 

Firstly, there is not always a one-to-one relation- 
ship between gene protein and reaction (Riley, 
1998). An enzyme can have two functions, or two 
polypeptides from two different genes can oligo- 
merize to perform a single function. It might be 
that some of the fold-functions combinations in 
Figure 2 occur together in multi-domain proteins 
(which otherwise were not the subject of this sur- 
vey). An exhaustive screening revealed that only 
four pairs of folds in Figure 2 were present concur- 
rently in multi-domain proteins. Each of these 
reduced by one the number of independent fold- 
function combinations. ('The four pairs were as fol- 
lows, with one representative Swissprot protein in 
each category, FC numbers in parentheses, and 
then SCOP fold numbers: ITAA_FCOLl (2.7.1) has 
4.049 and 2.055 folds, TRP COPCI (4.2.1) has 3.057 
and 4.005 folds, L'RF1_F1 HLFF (3.5.1) has 4.005 and 
2.056 folds, while XYNA_RUMFL (3.2.1) has 2.01 S 
and 3.001 folds.) 

Secondly, the functions associated with similar 
structures often turn out to be analogous, even if 
they show significant difference in their FX num 
hers For example, acetvl-CoA carboxylase and 
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reactions, not underlying biochemical mechanisms. 
An enzyme classification system based explicitly 
on reaction mechanism (e.g. "involves pyridoxal 
phosphate'' or "involves Ser as a nucleophile") 
might also prove interesting to compare with pro- 
tein structure. Alternatively, one based on path- 
ways might be worthwhile since, as pointed out by 
Martin et al. (1998), "it may be that more signifi- 
cant relationships occur within pathways, where 
the substrate is successively transferred from 
enzyme to enzyme along the pathway, requiring 
similar binding sites at each stage". 

Finally, in all of Swissprot the majority 7 of the 
101 folds with only non-enzymatic functions prob- 
ably have several functions, but we were not able 
to consider them separately here, lacking a general 
protein function classification system for non- 
enzymes. Such a system is not easy to derive. For 
instance, if we took only the first three words of all 
the description lines in Swissprot, we would end 
up with about 10,000 different protein functions 
(besides enzymes). An approximate solution to this 
problem is offered by a recent work that has classi- 
fied 81 % of Swissprot into one of three broad cat- 
egories in an automated fashion (Tamames et al., 
1997). However, one way we did tackle this pro- 
blem was by focussing on the yeast genome for 
which there are a number of overall functional 
classification systems. This work showed that the 
preferred association of folds with certain functions 
occurs for non-enzymes as well as enzymes. Fur- 
thermore, the results for the highly conserved 
COGs would be expected to be exactly the same in 
other genomes. 

Biases 

Our results are undoubtedly affected to some 
degree by the biases inherent in the databanks, e.g. 
towards mammalian, medically relevant proteins 
and towards proteins that easily crystallize. Such 
biases probably result in the higher representation 
of enzymes in the structural databases, in the PDB 
and therefore in SCOP. This might be the cause of 
the higher occurrence of alpha /beta proteins in our 
tables and the higher density of matches in this 
class. 

One interesting question related to biases is 
whether looking only at individual genomes 
instead of the whole database will give different 
results. Our results for yeast suggest that it is not 
necessarily the case. 

Comparison with Martin et al. (1998) 

Martin et al. (1998) performed a similar analysis 
to the one described here. One of the conclusions 
of their careful study was that there was no 
relationship between the top-level CATH classifi- 
cation and the top-level EC class. This seems to be 
at odds with our results. However, we have found 
the conclusions to be consistent. There are a num- 
ber of reasons for this. 



Firstly, Martin et al. (1998) tabulate statistics on 
only the proteins in the PDB. They found a clear 
alpha/beta preference for proteins in the oxido- 
reductase, transferase, and hydrolase categories 
(EC 1-3), but for the lyase, isomerase, and ligase 
categories (EC 4-6) they observe different ten- 
dencies. However, they did not have sufficient 
counts to establish statistical significance for this 
latter finding. (This is basically what we observe in 
Figure 4(b).) Because in our analysis we use all of 
Swissprot and we tabulate our statistics a little dif- 
ferently (in terms of combinations), we get more 
"counts" than Martin et al. (1998). Thus, we are 
able to argue that the different distribution of fold- 
function combinations observed for lyases, iso- 
merases, and ligases are significant. This is borne 
out by the chi-squared statistics at the end of 
Table 2. 

Secondly, Martin et al "no-relationship" con- 
clusion applies only to comparisons between the 
different enzyme classes. However, we find our 
largest differences when comparing non-enzymes 
to enzymes and also comparing between the var- 
ious types of non-enzymes. 

Finally, the CATH classification that Martin et al. 
use has only three classes in its top-most level. In 
contrast, SCOP has six top classes (Table 1). While 
this larger number of categories does tend to 
degrade our statistics somewhat, it also highlights 
some differences that cannot be observed in terms 
of the CATH classes alone, e.g. we find clear differ- 
ences between alpha -f beta and alpha /beta pro- 
teins and also between small proteins and all 
others. 

Apparently high occurrence of 
convergent evolution 

Note that the table in Figure 2 is not square: it 
has more folds than functions. This shape leads to 
a number of interesting conclusions. The 331 fold- 
function combinations we observe for 229 folds 
and 92 functions imply that there are 1.2 functions 
per fold and 3.6 folds per function. However, these 
numbers are somewhat skewed by the large num- 
ber of folds (101) associated only with the single 
non-enzymatic function. If we exclude these, w r e 
get 128 "enzyme-related" folds, which are, in turn, 
associated with 230 (=331 - 101) different fold- 
function combinations. This implies that for the 
enzyme-related folds there are on average 1.8 func- 
tions per fold and 2.5 folds per function (230/128 
and 230/92). The larger number of folds per func- 
tion than functions per fold seems to suggest that 
nature tends to reinvent an enzymatic function (i.e. 
convergent evolution) more often than modify an 
already existing one (i.e. functional divergence). 

How can we explain this 7 Firstly, 1.8 is a lower 
estimation for the number of functions per fold as 
the non-enzymatic functions were bundled into 
one group here. Secondly, there are several 
examples of functional divergence for a fold within 
one three-component enzyme category that are not 
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reflected in our Tobies. For instance, the 1.1.1 cat- 
egory has 248 different enzymes, which all share 
the same fold. Thirdly, the results in this paper 
were derived from databases comprised of data 
from several organisms. It is quite possible that 
within one organism, functional divergence is 
more prevalent than convergent evolution. 

Superfolds and superfunctions 

Are functions more diverse for the more com- 
mon folds? To some degree this brings up a 
"chicken-and-egg" issue. Do folds have more func- 
tions because they occur more often or is it the 
other way around 7 TTie commonness of a fold is 
often quantified by the number of non-homologous 
sequence families accommodated by the fold, and 
folds accommodating many families of diverse 
sequences have been dubbed "superfolds'' 
(Orengo ct nl., 1993). We find that there seems to 
be a loose connection between the number of 
diverse sequence families associated with a par- 
ticular fold (in SCOP) and the functional diversity 
of that fold. For instance, the top superfold is the 
TIM -barrel; it also has the most functions associ- 
ated with it (15 different enzymatic functions as 
shown in Figure 4). On the other hand, there are 
exceptions: the alpha /beta hydrolases and the 
Rossmann fold are both associated with 22 
sequence families in SCOP, but while the former 
has eight different enzymatic functions, the latter 
has only three. 

Finally, while there is a high incidence of par- 
ticular functions with many folds ("superfunc- 
tions"), as well as folds with many functions, the 
distribution of superfunctions appears to be more 
uniform and less concentrated on a few exception- 
ally versatile individuals than is the case for folds. 
That is, comparing Figures 3 and 4 one can see 
that the top nine most versatile functions are 
associated with five to seven folds while the top 
nine most versatile folds carry out from six to as 
many as 16 functions. This last value is for the 
TIM-barrel and underscores the uniqueness of this 
fold as a generic scaffold (see Figure 1 tor an illus- 
tration of this fold) 

Why folds are associated with functions: 
chemistry versus history 

Why is a certain fold chosen to ca rrv out a par- 
ticular function 7 It is, of course not possible to 
answer this question definitively at present. How- 
ever, there are two broad themes that emerge from 
our analysis The first is favorable chemistry. Per 



more chemically suitable. This could be the situ- 
ation for the ribosomal proteins (and is borne out 
bv the results of Figure 4(d)). 

Materials and Methods 

Sequence matching to swissprot 

All the protein sequences in Swissprot 35 were com- 
pared with all the protein domain sequences in SCOP 
1.35 by standard database search programs (WU-BLAST, 
Altschul et al. f 1990). The following five criteria were 
used in the searches: (1) At least three of the four com- 
ponents of the FC number are assigned in the DH line of 
the Swissprot entries. (2) Fragments in Swissprot were 
excluded (this affected about 10% of the entries). (3) For 
WU-BLAST searches an rvalue threshold of 0.0001 was 
used, unless stated otherwise. (4) Only "monoenzymes", 
i.e. proteins with only one enzymatic function, were con- 
sidered. This excluded less than 0.5% of the Swissprot 
enzymes. (5) Only single-domain matches with Swis- 
sprot proteins were taken into consideration. This means 
those proteins that had a match with a SCOP domain 
covering most of the Swissprot protein. Specifically, we 
required that less than 100 amino acid residues be left 
uncovered in the Swissprot entry by a match. We are 
aware that this is only an approximation, as there are 
domains with less than 100 amino acid residues; how- 
ever, it is considerably less than the average length of a 
SCOP domain (163 residues) and seems to be a reason- 
able threshold in an automated approach. 

All the searches were repeated using FAST A with an 
r-value threshold of 0.01 (Pearson, 1998; Pearson & 
Lipman, 1988). The results obtained by the two different 
comparison programs were in agreement with each 
other. That is, the FASTA searches did not result in any 
new combinations of folds and enzymatic functions (a 
new dot in Figure 1), and therefore are not shown. 

Sequence matching to the yeast genome 

To get as great a coverage of the yeast genome as 
possible, we did a sequence comparison for just Figure 4 
using an altered protocol. We first ran the PDB against 
the veast genome using FASTA and kept all matches 
with a better than 0 01 e-value (Pearson, 1998; Pearson & 
Lipman, 1988) Then, to increase our number of matches 
further we used the PSl-hlast program (Altschul ct w/ . 
1 L, 97) I his program is somewhat more complex to run 
than FAS'IA, involving embedding the veast genome 
m \RI)B and running PDB query sequences against it 
in an iterative fashion, adding the matches found at 
each round to a growing profile. We used the I*S1 -blast 
parameters adapted from Teichmann ct nl. (1998): an 
c-value threshold of (1.0(X)5 to include matches in the 
profile and iteration of up to 30 times or to conver- 
gence. We did not continuously parse the output and 
accepted matches at the final iteration that had F-value 
scores better than 0.000] I he number of iteration to 
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through running the SEC program with standard par- 
ameters (Wootton & Federhen, 1996). 



How the structural classifications were used: SCOP 
and CATH 

SCOP hierarchically clusters all the domains in the 
PDB database, assigning a five-component number to 
each domain (Murzin et al., 1995). The first component in 
the SCOP numbers denotes the structural class to which 
the domain in question belongs. The second component 
of the SCOP numbers designates the fold type of the 
domain. There are altogether 361 different fold types in 
SCOP 1.35. The six SCOP classes used in this survey are 
listed in Table IB. 

In this study, a 95% non-redundant subset of SCOP 
was used, i.e. all pairs of domains had less than 95% 
sequence homology. This set is denoted pdb95d and is 
available from the SCOP website (scop.mrc-lmb.cam.a- 
c.uk). We used version 1.35, which had 2314 protein 
domains. (The yeast analysis used a more recent version 
of SCOP, 1.38, which had 3206 domains.) 

The CATH classification classifies structures in analo- 
gous fashion to SCOP (Orengo et nl., 1997). However, the 
exact structure of the classification is not the same, 
with an additional architecture level inserted between 
the top-level class and the fold-level. In our use of 
the classification, we created a limited mapping table 
that associated each SCOP domain in pdb95d with 
its corresponding classification in CATH 1.4. This 
was not always possible to do unambiguously. As a 
result, we left out the ambiguous matches from the 
statistics. 



How the functional classifications were used: 
ENZYME, COGS, and MIPS 

The EC numbers of enzymes are composed of four 
components (Barrett, 1997). (1) The first component 
shows to which of the six main divisions the enzyme 
belongs. (2) The second figure indicates the subclass 
(referring to the donor in oxidoreductases or the group 
transferred in transferases, or the affected bond in hydro- 
lases, lyases or ligases). (3) The third figure indicates the 
sub-subclass (e.g. indicating the tvpe of acceptor in 
oxidoreductases). (4) The fourth figure gives the serial 
number of the enzyme in its sub-subclass. The six main 
divisions are listed in Table 1 A. 

In the analysis of all of Swissprot, when we counted 
the number of non-enzymatic matches, all the proteins 
called 'HYPOTHETICAL' and all the proteins having an 
'-ase' word ending but lacking an EC number in their 
description were excluded, because of their functional 
ambiguity. For relating the sequence matches of the 
yeast genome to the EC system, we used essentially the 
same criteria as we did for all of Swissprot (see above): 
single-domain, monoenzvme matches with at least a 
three-component EC number. 

The COGs and especially the MII*S classifications are 
a bit more complex than the EC system in that they 
include non-enzymes as well as enzymes (Tatusov et ai., 
1997, Koonin ct'al., 1998, Mewes et nl., 1997) They often 
associate multiple functions or roles to a given yeast 
ORE This happens for more than a third of the yeast 
ORFs with Mil's. In this case, if we could clearly show a 
PDB match was associated with a single functional 
domain we made only that pairing Otherwise we associ- 



ated all the functions assigned to a given PDB match to 
its respective fold. 

Availability of results over the internet 

A number of detailed tables relevant to our study 
will be made available over the Internet at http:// 
bioinfo.mbb.yale.edu/genome/foldfunc, in particular, a 
"clickable" version of Figure 1 and large data files giving 
all the fold assignment and fold-function combinations 
for Swissprot and yeast. 
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Annotation Transfer for Genomics: Measuring 
Functional Divergence in Multi-Domain Proteins 

Hedi Hegyi and Mark Gerstein 1 

Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA 

Annotation transfer is a principal process in genome annotation. It involves "transferring" structural and 
functional annotation to uncharacterized open reading frames (ORFs) in a newly completed genome from 
experimentally characterized proteins similar in sequence. To prevent errors in genome annotation, it is 
important that this process be robust and statistically well-characterized, especially with regard to how it 
depends on the degree of sequence similarity. Previously, we and others have analyzed annotation transfer in 
single-domain proteins. Multi-domain proteins, which make up the bulk of the ORFs in eukaryotic genomes, 
present more complex issues in functional conservation. Here we present a large-scale survey of annotation 
transfer in these proteins, using scop superfamilies to define domain folds and a thesaurus based on 
SW1SS-PROT keywords to define functional categories. Our survey reveals that multi-domain proteins have 
significantly less functional conservation than single-domain ones, except when they share the exact same 
combination of domain folds. In particular, we find that for multi-domain proteins, approximate function can be 
accurately transferred with only 35% certainty for pairs of proteins sharing one structural superfamily. In 
contrast, this value is 67% for pairs of single-domain proteins sharing the same structural superfamily. On the 
other hand, if two multi-domain proteins contain the same combination of two structural superfamilies the 
probability of their sharing the same function increases to 80% in the case of complete coverage along the full 
length of both proteins, this value increases further to > 90%. Moreover, we found that only 70 of the current 
total of 455 structural superfamilies are found in both single and multi-domain proteins and only 14 of these 
were associated with the same function in both categories of proteins. We also investigated the degree to which 
function could be transferred between pairs of multi-domain proteins with respect to the degree of sequence 
similarity between them, finding that functional divergence at a given amount of sequence similarity is always 
about two-fold greater for pairs of multi-domain proteins (sharing similarity over a single domain) in 
comparison to pairs of single-domain ones, though the overall shape of the relationship is quite similar. Further 
information is available at http://partslist.org/func or http://bioinfo.mbb.yale.edu/partslist/func. 



The ultimate goal of the genome projects is to determine the 
structure and function of all the newly identified gene prod- 
ucts. Fundamentally, this will be carried out via annotation 
transfer, transferring the structural and functional annotation 
from an experimentally characterized protein (as in a model 
organism such as Escherichia coli) to a predicted protein in a 
newly sequenced genome that shares similarity in sequence. 
The degree of annotation transferred will depend on the de- 
gree of sequence similarity. This process is shown schemati- 
cally in Figure 1 In tins paper, we aim to address this major 
question in hiointormatics, .specifically focusing on multi- 
domain proteins, as they make up the hulk ot the proteome m 
eukaryotu organisms Uierstein 1998). 

Our work is a direct outgrowth ot two previous analyses 
of ours that concentrated on single-domain proteins. In an 
earlier paper, we tound that the different structural classes of 
the scop classification system have different propensities to 
carry out certain types ot function (Hegyi and derstein 1999) 
In particular, while the alpha/beta folds were disproportion- 
ately associated with en/vmes and all-alpha and small folds 
with non-enzymes, the alpha + beta structures had an equal 
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Wilson et al. (2000) compared a large number of protein do- 
mains to one another in a pair-wise fashion with respect to 
similarities in sequence, structure, and function. Using a hy- 
brid functional classification scheme merging the KNZYME 
and FlyBase systems (Gelbart et al. 1997; Bairoch 2000), they 
found that precise function is not conserved below 30-40% 
identity, although the broad functional class is usually pre- 
served for sequence identities as low as 20-25%, given that 
the sequences have the same fold Their survey also reinforced 
the previously established general exponential relationship 
between structural and sequence similarity (( hothia and Lesk 
1 <-*H6 ) 

Other Work on Establishing Relationships between 
Sequence, Structure, and Function 

Several other groups have studied the relationship between 
sequence, structure, and function in detail, attempting to de- 
termine the extent to which functional transference between 
matching proteins is feasible (Shah and Hunger 1997; Martin 
et al. 19<>K, Thornton et al. 19'jg, 200/0; Zhang et al 199^, 
Shapiro and Harris 2000; (odd et al 2001). Orengo et al 
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ct«._. re 1 Schematic illustrating annotation transfer. This figure illustrates the process of annotation transfer for a group of hypothetical TIM barrel 
proteins. The leftmost panel represents sequence comparisons between idealized barrel domains from a number of organisms. The next panel 
shows analogous results for structural comparison, and the panel after that, functional comparison. The rightmost panel represents sequence 
comparisons between idealized multi-domain proteins that match over a single domain, the subject of much of this paper. 



Pawlowski et al. (2000) studied the relationship between se- 
quence and functional similarity in the twilight zone of 10%- 
15% sequence similarity and found a clear correlation be- 
tween the two, with functional similarity based on the E C. 
classification of enzymes. 

Russell et al. (1997) analyzed binding sites in proteins 
with similar 3D structures and estimated that 90% of new 
remote homolog have common binding sites and similar 
functions. Eisenstein et al. (2000) evaluated the first results 
from the structural genomics projects and found that in many 
instances the protein structure itself offers an important clue 
to its biological function. Stawiski et al. (2000) found that 
function could be predicted rather successfully for just the 
proteases. Devos and Valencia (2000) presented a critical view 
of function transference between similar sequences, high- 
lighting the limitations of this process due to errors in data- 
bases and the inherent complexity of the relationship be- 
tween protein sequence-structure and function that does not 
allow "simplistic interpretations." They also found that bind- 
ing sites are the least conserved features between related pro- 
teins while the catalytic activity of enzymes is the most con- 
served one. 

Multi-Domain Proteins with Divergent Functions; 
How Common? 

Most of these previous investigations focused on single- 
domain proteins or did not distinguish between single- and 
multi-domain ones. It is not clear how the multi-domain pro- 
teins with various functions behave with respect to functional 
conservation; namely, whether they are more or less con- 
served than their single-domain counterparts In particular, as 
shown in Figure 1, it one multi-domain protein shares a single 
domain fold with another one, it is not clear the degree to 
which the functional conservation ot these proteins is con- 



of the SH3-domain (scop superfamily identifier 2.24.2) and 
the P-loop containing NTP hydrolase (3.29.1). While in 
higher organisms this combination is associated with presyn- 
aptic and tumor suppressor functions (SWISS-PROT names 
SP02_HUMAN and DLGLDROME, respectively), in the lower 
Dictyostelium it was found in myosin (MYSP_DICDI). An- 
other example is the combination of the FAD/NAD(P)- 
binding superfamily and EAD-linked reductases C-terminal 
superfamily (3.4.1 and 4.12.1 superfamilies, respectively). In 
one group of proteins they appear in enzymes of the oxido- 
reductase group (e.g. OXDA_CAEEL or PHHY_PSEAE), while 
in another they are found in a dissociation inhibitor (e.g. 
GDIA_HUMAN). It should be noted that the proteins are not 
covered completely by the structural matches, so it is quite 
possible that the rest of them contain totally different do- 
mains that are responsible for the dramatically different func- 
tions. However, do these two examples show a rather rare or 
a more frequent phenomenon? How often do multi-domain 
proteins, sharing the same structural domain composition, 
differ in their functions? 

In this paper, we attempt to provide a comprehensive 
answer to this question. This is particularly timely given that 
most ot the unknown proteins in eukaryotic genomes are 
multi-domain. We use the same approach as in our previous 
analyses, comparing the sequences ot the structural domains 
in scop to those of SWISS-PROT" using BLAST P. We focus on 
the functional divergence of single and multi-domain pro- 
teins, extending previous investigations of single-domain 
proteins. Also, in comparison to previous work, we focus 
more on non-enzymatic functions and scop structural super- 
families, instead of folds. 

RESULTS 
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2000) with c= 10 4 . We removed the hypothetical and frag- 
ment proteins. This resulted in two sets of proteins. 

Single-Domain 

Of the single-domain matches, only those that were almost 
completely covered with a match to a single structural do- 
main were selected. (The maximum number of uncovered 
residues w^as set at 70 with an additional condition that a 
maximum of 40 residues on the N-terminal end and 30 resi- 
dues on the C-terminus were allowed to be uncovered.) These 
criteria resulted in 1818 single-domain proteins being selected 
from SWISS-PROT. 

Multi-Domain 

We selected 4763 multi-domain proteins from SWISS-PROT. 
All of these matched (in different locations) at least two do- 
mains of known structure belonging to different scop super- 
families (see schematic in Figure 1). We also selected a subset 
of these proteins that have almost their entire length covered 
by matches with structural domains (allowing again a maxi- 
mum of 70 uncovered residues). This selection resulted in 
2829 proteins being selected from SWISS-PROT. (In all cases, 
duplicate matches were removed, i.e., a protein at a certain 
location matches only one structural domain.) 

We set out to compare these two sets of proteins for 
functional divergence. As previously, we divided functions 
into enzyme and non-enzyme (Hegyi and Gerstein 1999). En- 
zymatic functions were classified by the EC system (Bairoch 
2000). Comparisons of enzymatic functions were treated the 
same way as in our earlier analyses, that is, if they differ in the 
first three components of their respective EC numbers, they 
were considered different. This implied that our analysis dealt 
with a total of 1 12 enzymatic functions. Non-enzymatic func- 
tions were classified into 508 different categories based on a 
simple thesaurus we assembled of synonymous keywords 
drawn from SWISS-PROT description lines. In addition, we 
created 49 categories for functions that have an enzymatic 
component but which are not part of the EC system. This gave 
us a total of 669 functions (112 + 508 + 49). (The list of all the 
functional categories is described further in Table 2 below, 
and also can be found on the Web at http://bioinfo. 
mbb.yale.edu/partslist/func or http://partslist.org/func.) 

Overall Distribution of the Matches 

Figure 2 shows the most commonly observed multi-domain 
combinations in a set of recently sequenced genomes The 
occurrences ot further combinations are available from the 
Web site Clearly, the distribution is very skewed, with certain 
combinations, such as 3 29-2.32, and 2. 29-4. til tending to 
predominate. 

Figure 3 shows the overall distribution of the single- 
domain and multi-domain matches in the different structural 
classes. The distribution of matches between enzymes and 
non-enzymes in multi-domain proteins largely agrees with 
that in the single-domain proteins. The multi-domain 
matches follow the overall tendency of the alpha/beta folds to 
be associated with enzymes to a larger extent and the all- 
alpha and small folds with non-enzymes. However, the values 
fnr the multi-domain matches are generally less extreme than 



FOLD PAIRS 




E 


l> 

xz 

e 


phor I 




iljijjillljj 

2 i 4 5 ; 4 ! 4 ' 3 : 


>* 
u 

SZ 




* I u 

C ■ 3 

B|£ 


- I * 
e I u 

5" ^ 




4 

1 
1 


_3_ 
1 
1 


4 

1 
1 


3 
2 
1 


HO 

6 3 


5 
3 


3J 3 : 4 

3 ! 4 ; 1 


5 
2 


3 
3 


4 

3 


4 

2 


5 


3 


1 ; 3 ' 1 | 1 


2 
1 
5 


1 


1 


1 ! 


1 
1 


2 


2 


1 


1 


1 
1 


1 

4~ 


1 

T 


2 


1 


1 


1 


1 l 1 ! 2 | 1 


2 
2 


1 


1 


1 
1 


1 
1 


1 


1 


1 


2 


3 


4 


1 


2 




2 


2 


1 


1 


2 


1 




1 


0 


4 


5 


3 
1 


4 

3 


S I 4 


4 


3 


4 


1 


1 


2 


2 


3 


3 


1 


1 
2 


1 


1 


1 


4 


2 


2 


1 


1 


1 


1 


2 


2 




1 


1 


1 


1 


1 




2 


T] 


2j j i 


2 


1 


1 


1 
1 
0 
1 
2 


1 
1 

r 

2 
2 


1 
1 
1 
1 
1 


1 


1 


1 


1 
3 
0 


1 


1 


T 

0 


5 1! 1 




1 


1 


1 


1 
2 
1 
2 


1 
2 

2 


0 
2 
2 
2 


3 
0 


3 
1 


flffH 6 5 6 B 

nin 1 [ 3 ; 3 \ i_ 


D 

^ 1 
1 


2 
2 
2 


4 

1 

l" 


0 
1 
2 


1 


0 


0 


0 




3 


1 


1 


2 


2 


1 


1 


1 


0 


3 


3 


2 


1 


1 


2 


1 


1 


0 


1 


1 


1 


0 


1 


1 


1 


1 


1 


2 


1 






1 


1 


5 


1 




1 


1 


1 


1 


1 


1 


1 


1 


1 


2 


2 


2 


2 


2 


1 


1 


1 


1 


1 


1 


1 


0 


2 


2 


1 




1 


0 


0 


1 


0 


1 


1 


2 


1 


1 


1 


1 


2 


1 


1 


1 


0 


0 


0 


1 


1 


1 


1 


1 




1 


1 


2 


1 


1 


1 


1 


1 




1 




0 


1 


1 


t 


1 


1 


1 










2 


1 


1 


2 


1 


2 


2 


1 


2 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


0 


3 


1 


2 


2 
1 


2 


* 




2 


1 


0 


0 


1 


1 


1 


1 


0 


0 


0 


0 


0 


0 


1 


| 2 


2 i 1 


r* 


1 


1 I 1 


1 


1 


1 


1 


1 


1 



Figure 2 Dlili lUUiiOh Of iiiLiiti-diHiiain CO 1 1 1 bi i iStiOi t5 amOngSt the 

genomes. The figure shows the occurrence of multi-domain fold com- 
binations in a number of genomes, indicating its great variability. 
Each row indicates a particular combination of scop fold pairs (using 
scop 1.39), where a fold pair is defined as two distinct folds occurring 
in tandem in a protein. Each column represents a different genome, 
using the four-letter codes in the PartsList system (Qian et al. 2001): 
Aaeo, Aquifex aeolicus; Aful, Archaeoglobus fulgidus; Bbur, Borreiia 
burgdorferi; Bsub, Bacillus subtilis; Cele, Caenorhabditis etegans; Cpne, 
Chlamydia pneumoniae; Ctra, Chlamydia trachomatis; Ecol, Echerischia 
co//; Hinf, Haemophilus influenzae Rd; Hpyl, Helicobacter pylori; Mthe, 
Methanobacterium thermoautotrophicum; Mjan, Methanococcus jan- 
naschii; Mtub, Mycobacterium tuberculosis; Mgen, Mycoplasma geni- 
talium; Mpne, Mycoplasma pneumoniae; Phor, Pyrococcus horikoshii; 
Rpro, Rickettsia prowazekii; Seer, Saccharomyces cerevisiae; Syne, Syn- 
echocystis sp.; Tpal, Treponema pallidum. The numbers in each inter- 
section cell indicate the number of times the fold pairs occur in a 
genome. Only the 20 most common fold pair combinations are 
shown here; the remainder are shown on the Web site (http:// 
partslist.org/func). If a cell is greater than 6, it is shaded black; be- 
tween 3 and 6, gray; and below 3, white. The blank spaces show 
instances in which one of the pairs does not occur in the organism at 
all (indicated by a value of -1 in the data table on the Web site). The 
fold assignments are done in a fashion consistent with those in 
PartsList and associated systems (Gerstein 1997; Lin et al. 2000; Dra- 
wid et al. 2001; Harrison et al. 2001; Qian et al. 2001). 

tural classes compared to the single-domain matches. Alto- 
gether, there are more enzymes than non-enzymes among the 
multi-domain proteins (2805 enzymes vs 1958 non-enzymes) 
whereas tor single-domain proteins, the opposite is true (850 
enzvmes vs. ( >68 non-enzymes). 

Table 1 summarizes the distribution ot supertamilies and 
supertamilv combinations among the major functional 
classes, i.e. whether they have only enzymatic, only non- 
enzvmatic or both enzymatic and non-enzymatic functional- 
ity. Altogether, 215 superfamilies were found in single-domain 
proteins and 310 in multi-domain ones. As 70 superfamilies 
were found in both, altogether 455 distinct structural super- 
tamilies matched a SWISS-PROT protein with our required 
coverage criteria (described above). Similarly, we apportioned 
the 281 supertamilv combinations observed in multi-domain 
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Figure 3 Distribution of proteins amongst broad structural and 
functional classes; the distribution of the matches among the seven 
structural and two functional classes in single- and multi-domain pro- 
teins. The single-domain and multi-domain matches each total 
1 00%, independently of each other. The horizontal axis indicates the 
seven scop classes, which are (from 1 to 7): all-alpha, all-beta, alpha/ 
beta, alpha + beta, multi-domain, membrane, and small protein. 

to almost threefold (135 vs. 56). This agrees with the notion 
that most enzymes are multi-domain. Another difference be- 
tween single and multi-domain proteins appears in the ratio 
of superfamilies with a single function compared to multi- 
functional ones. As it is apparent from Table 1, about a quar- 
ter of the superfamilies matched single-domain proteins with 
different functions (55 of 215), whereas in the multi-domain 
proteins, this ratio increased to more than a third (119 of 310). 

Single-Domain Proteins 

Table 2 lists the two functionally most diverse structural su- 
perfamilies in single-domain proteins with some representa- 
tive functions. The most diverse superfamily, the 3.38.1 
Thioredoxin-like, has 11 different functions associated with 
it, most of them with an oxidoreductase mechanism. For in- 
stance, THIO_BPT4 is a small disulphide-containing thiore- 
doxin that serves as a general disulphide oxidoreductase, 



while TDX2_BRUMA is almost twice as long (199 aa) and 
serves as a thiol-specific antioxidant that acts against sulfur- 
containing radicals. Another interesting example of func- 
tional diversity is provided by the Scorpion toxin-like super- 
family (7.3.61. While BRAZ_PKNRA is a small protein that is 
known to he 2CXX) times sweeter than sucrose, the other mem- 
bers of the superfamily are associated with different host- 
defense mechanisms. In insects the superfamily possesses 
antifungal activity (DMYC. DROMK) or acts as a toxin 
(SCX5_BUTEU). Interestingly, in plants it can also act as an 
antifungal (AF2B_SINAL) or as an inhibitor of insect alpha- 
amylases (SIAl_.SORBI). It appears that many single-domain 
proteins are toxins or allergens, or are related in other ways to 
a host-defense response. 

Based on the data we can also determine the prohability 
of two single-domain proteins that match domains in the 
same superfamily category also carrying out the same func- 
tion, Using Bayes' theorem: 

P(F|S) = P(F)P(S|F)/((P(F)P(S|F) + PCF)P(S|-F)) U) 

where 5 is the probability that two proteins share the same 
superfamily, F is the probability that two proteins have the 
same function, and ~F is the probability that two proteins do 
not have the same function. Rearranging and simplifying the 
equation we get: 

P(FjS) = 1/(1 + N(S,-F)/(N(S # F)) (2) 

where N is the number of times that the two events in the 
parentheses occur together in our database of 1818 single- 
domain proteins. This results in 

P(F|S) = 1/(1 + 8501/12516) = 68%. 

That is, the probability that two single-domain proteins that 
have the same superfamily structure have the same function 
(whether enzymatic or not) is about 2/3. 

Muiti-Domain Proteins 

Table 3 lists the combinations of superfamilies that have been 
associated with the greatest number of different functions in 
multi-domain proteins, with representative entries in SW1SS- 
PROT . The combination with the greatest number of different 
functions is that of 1.95.1 and 7.33.1. Although it has twice as 
many different functions as the most diverse superfamily in 



Tabic 1. Functional Distribution of Single-domain, Multi-domain Superfamilies, and 
Multi-domain Combinations 



Single-domain Multi-domain Multi-domain sfam 

superfamilies superfamilies combinations 





Single 
function 


Multiple 
function 


Single 
function 


Multiple 
function 


Single 
function 


Multiple 
function 


Enzymatic 


82 


11 


135 


42 


151 


16 


Nonenzymatic 


78 


23 


56 


30 


70 


27 


Both functions 




15 




47 




17 


Total 


160 


55 


191 


119 


221 


60 



The basic functional distribution of the superfamilies in single- and multi-domain proteins and the 
functional distribution of multi-domain combinations are shown The first row lists the number of 
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Table 2. Most Versatile Single-Domain Super-families 
No. No. Sfam 



func prot 


comb 


Function 


SWISS-PROT ID 


SWISS-PROT function 

, — 






El. 11.1 


GSHP RAT 


Plasma Glutathione Peroxidase (1 .1 1 .1 .9) 






263# 


DYL5 CHLRE 


Dynein, Flagellar Outer Arm-C reinhardtii 






D260# 


BSAA BACSU 


Glutathione Peroxidase Homolog Bsaa 






268* 


REHYJTORRU 


Rehydrin-7ortu/o ruratis (Moss) 


11 69 


3.38.1 


266# 


PHOS HUMAN 


Phosducin (33 Kd Phototransducing Protein) 




269# 


REHY ORYSA 


Rad24 Protei n-Oryza sativa (Rice) 






272# 


THIO BPT4 


Thioredoxin (Bacteriophage T4) 






D271#272# 


TDX2 BRUMA 


Thioredoxin Peroxidase 2 






261 # 


BTUE.ECOLI 


Vitamin B12 Transport Periplasmic Protein Btue 






342# 


BRAZ.PENBA 


Brazzein-Pentad/p/ondra brazzeana 






376#336# 


SCKK TITSE 


Neurotoxin Ts-Kapa (TskH Brazilian scorpion) 






341#356# 


AF2B SINAL 


Cysteine-Rich Antifungal Protein 2b (Afp2b) 


10 28 


7.3.6 


343# 


DEFA_ZOPAT 


Defensin, Isoforms B And C-Zophobas atratus 


361 # 


DMYC DROME 


Drosomycin Precursor (Cysteine-Rich Peptide) 






361#376# 


SCX5_BUTEU 


Insectotoxin !5a-(Lesser Asian scorpion) 






336# 


SCX3 LEIQH 


Leiuropeptide MKScorpion) 






203# 


SIA1 SORBI 


Small-Pr Inhibitor Of Insect Alpha-Amylases 






31 0# 


AB18 PEA 


Aba-Responsive Protein Abrl 8-<;arden Pea 






311# 


DRR3 PEA 


Disease Resistance Response Protein Pi49 


7 34 


4.79.3 


231# 


MPAA CORAV 


Major Pollen Allergen Cor A 1,-Eu. Hazel 


312# 


L18B LUPLU 


Protein L1r18b (LlprlO.lb) 






E3.1.- 


RNS2 PANCI 


Ribonuclease 2 (3.1 .-/-)-Panax Ginseng 






31 4# 


SAM2_SOYBN 


Stress-Induced Protein Sam22 






184# 


CSF2 SHEEP 


Colony-Stimulating Factor 






381#564#184# 


IL4 RAT 


lnterleukin-4 (B-Cell Igg Diff. Factor) 


7 43 


1.26.1 


185# 


LIF HUMAN 


Leukemia Inhibitory Factor (Lif) 


187# 


PRL ANCAN 


Prolactin Precursor (Prl)- 






186# 


PLF3 MOUSE 


Proliferin 3 Mitogen-Regulated 






188# 


SOMA PAROL 


Somatotropin (Growth Hormone) 



The most versatile superfamilies in single-domain proteins as determined from their functional description in SWISS- 
PROT, with some representatives. The keyword combinations in the fourth column were based either on the first three 
components of their EC numbers (for enzymes) or derived automatically by comparing the DE description line of 
SWISS-PROT entries to a list of synonymous keywords at http://bioinfo.mbb.yale.edu/partslist/func. A keyword num- 
ber starting with a D indicates an enzyme that does not have an assigned EC number in its description in SWISS-PROT. 



the single-domain proteins (22 vs. 11, respectively), careful 
examination reveals that all the proteins in this category are 
DNA-binding and most of them act as hormone receptors. 

The second entry listed in the table is the combination of 
the 3.4.1 and 4.48.1 superfamilies associated with the FAD/ 
NAD(P)-linked reductases. It is an all-enzymatic combination 
and alwavs carries out an oxido-reductase function. All the 
proteins in this category are completely covered by matches 
with these two superfamilies. The 17H.l-2.il hemocvamn- 
immunoglobulin combination seems also to he fairly con- 
served; although the proteins in this category' are called by 
eight different names, most of them turn out to he extracel- 
lular larval storage proteins, except for the copper-containing 
oxvgen carrier hemocyanin itself (HCY.PALVU). 

Following the same logic, we can also determine the 
probability that two proteins that have the same superfamily 
combination share the same function, viz: 

I'(F|S) = 1/(1 + 32242/1/54230) = 81% 



almost complete coverage with exactly the same type and 
number of superfamilies, following each other in the same 
order. The probability that the functions are the same in this 
case was 91%, a considerably higher value than above. How- 
ever, if two multi-domain proteins share only a single super- 
family, the probability that they share the same function 
drops to onlv 3 5%! This greater functional certainty from 
sharing a combination of superfamilies rather than just one is 
also reflected in 'Fable 1. While one-fourth of the single- 
domain proteins and one-third of singularly matching super- 
tamilies in multi-domain proteins have multiple functions, 
onlv about one-fifth of the multi-domain combinations pos- 
sess multiple functions (60 of 281). It is also clear from the 
data that domains in larger proteins often lose their original 
function and no longer have an autonomous function. 

Seventy Common Superfamilies and Their 
Functions Compared in Single-Domain 
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Table 3. Most Versatile Superfamily Combinations in Multi-Domain Proteins 
No. No. Sfam 

func prot comb. Function SWISS-PROT ID SWISS-PROT function 



22 



176 1.95.1/7.33.1 



29 # THB_RANCA Thyroid Hormone Receptor Beta 

10# HNF4_DROME Transcription Factor HNF-4 Homolog 

31#32# EAR2_MOUSE V-Erba Related Protein Ear-2 

29#30# ECRJvlANSE Ecdysone Receptor (Ecdy steroid Receptor) 

32# ERBA.AVIER Erba Oncogene Protein 

556#564#35# NGFI_XENLA Nerve Growth Factor Induced Protein l-B 

576# NR42JHUMAN Immediate-Early Response Protein Not 

36# PPAT_HUMAN Peroxisome Proliferator Activated Receptor 

37# RXTG CHICK Retinoic Acid Receptor RXR-Gamma 

38# TLL_DROV1 Tailless Protein 



El .8.2 
El. 8.1 

54 3.4.1/4.48.1 El .6.4 
El .16.1 
El. 6.99 



DHSILCHRV1 Sulfide Dehydrogenase (1 .8.2.-) 

DLDH_ZYMMO Dihydrolipoamide Dehydrogenase (1 .8.1 .4) 

TYTRJTRYCR Trypanothione Reductase (1 .6.4.8) (Tr) 

MERA_STRLI Mercuric Reductase (1 .1 6.1 .1) 

NAOX_MYCPN Probable NADH Oxidase (1 .6.99.3) (Noxase) 





19# 


ARYB MANSE 


Arylphorin Beta Subunit~(Tobacco Hornworm) 




20# 


CRPI PERAM 


Allergen Cr-Pi Precursor-^ American Cockroach) 




21#427# 


HCY PALVU 


Hemocyanin-{European Spiny Lobster) 


8 


23 1.78.1/2.1.1 22# 


HEXA BLADI 


Hexamerin Precursor-(Tropical Cockroach) 




23# 


JSP1 TRINI 


Acidic |uvenile Hormonne-Suppressible Protein 




24# 


LSP2 DROME 


Larval Serum Protein 2 Precursor (LSP-2) 




546#25# 


SSP1 BOMMO 


Sex- Specific Storage-Protein 1 



Note that the combination with the greatest number of different functions is that of 1.95.1 and 7.33.1. Careful 
examination reveals that all the proteins with this combination are DNA-binding and most of them act as various 
hormone receptors. In particular, HNF4_DROME and NR42_HUMAN also have transcription activator functions. Note 
that these two proteins are considerably longer than the others in this group and are not covered completely by 
structural matches: A large C-terminal and a large N-terminal portion are left uncovered, respectively. 



multi-domain proteins. These are listed in Table 4; 12 of them 
have enzymatic function, supporting the notion that en- 
zymes are more conserved during evolution than non- 
enzymes. The two non-enzymatic superfamilies are the 4.29.1 
ribosomal superfamily and the 5.4. 1 superfamily in penicillin- 
binding proteins. 

Table 5 presents several examples of the converse situa- 
tion, shared superfamilies that have different functions in 
single and multi-domain proteins. Comparing parts A and B 
of the table highlights the fact that although both superfami- 



lies in a multi-domain protein are often present in single- 
domain form as well, the functions in the different settings 
are only vaguely related. One example is the combination of 
the lipocalin superfamily (2.45.1) with that of the BPTI-like or 
Kunitz inhibitor (7.7.1), which in higher organisms forms a 
complex protein called alpha-l-microglobulin (AMBP_RAT). 
Another interesting example is the combination of the 2.5.1 
Cupredoxin {occurring in the single-domain blue-copper pro- 
tein, SOXE_SULAC) and the 6.5.1 Membrane all-alpha 
(single-domain representative: BACT„HALVA, a sensory rho- 



Tablc 4. Superfamilies With the Same Function in Single- and Multi-Domain Proteins as Determined from Their Keyword 
Combination or First Three Components of Their EC Numbers 

Single-domain proteins Multi-domain proteins 



SWISS-PROT SWISS-PROT 
Sfam Function ID SWISS-PROT function ID SWISS-PROT function 



1.81.1 E3.2.1 GUNYERWCH Endogiucanase (3.2.1 .4) 

2.66.2 E3.5.1 URE2_YERPS Urease Beta (3.5.1 .5) 
3.17.2 E6.3.5 NADE_MYCPN NAD(+) Synthetase (6.3.5.1) 

3.37.1 E3.1 .3 PTP2 NPVOP Protein-Tyrosine Phosphatase 2 (3.1 .3.48) 

3.67.1 E4.2.1 TRPB„VIBPA Tryptophan Synthase (4.2.1 .20) 

4.19.1 E5.2.1 FKB1 MET]A Peptidylprolyl Gs- Trans Isomerase (5.2.1.8) 

4.2.1 E3.2.1 LYCV_BPP2 Lysozyme (3.2.1 .1 7) 

4.29.1 85# RS5 ACYKS 30s Ribosomal Protein S5 



AMYG^NEUCR 
URE1_HELPY 
GUAA_YEAST 
PTNB_RAT 
TRP_YEAST 
F KB 7 WHEAT 
CHIX_PEA 
RS5 TREPA 



Glucoamylase Precursor (3.2.1.3) 
Urease Alpha Subunit (3.5.1.5) 
GMP Synthase (6.3.5.2) 
Protein-Tyrosine Phosphatase (3.1 .3.48) 
Tryptophan Synthase (4.2.1 .20) 
70 Kd Peptidylprolyl Isomerase (5.2.1.8) 
Endochitinase Precursor (3.2.1.14) 
30s Ribosomal Protein S5 
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Table 5. Examples of Superfamilies Present in Both Single- and Multi-Domain Proteins 
Carrying out Different Functions 



Table 5A. Single-Domain Proteins 



Sfam 


Funct # 


SWISS-PROT ID 


SWISS-PROT function 


1.25.1 


352# 
183# 
El. 17.4 
192# 


FTN2 HAEIN 
NICY DESVH 
RIR4 YEAST 
NLP„HAEIN 


Ferritin-like Protein 2 
Nigerythrin 

(Ribonucleotide Reductase) (1.17.4.1) 
Ner-like Protein Homolog 


1.4.3 


196# 


H 1 A_PLADU 


Histone HI A, Sperm 


1.81.2 


E2.5.1 


PFTB_PEA 


Famesyltransferase Beta Su (2.5.1.-) 


2.45.1 


226# 
22 7# 

228#412# 
229# 
E5.3.99 
230#421# 


ERBP RAT 
FAB 3 CAEEL 
NCAL MOUSE 
NP4 RHOPR 
PCHD HUMAN 
VNSl_MOUSE 


EDididvmaUTptinoir Arid Rinrlinn Prntoin 

k^/iuiu^iiiui i ciii iuiv r^v.. i\_J qii lull IU rlULclll 

Fatty Acid-Binding Protein Homolog 3 
Neutrophil Gelatinase-Assoc. Lipocalin 
Nitrophorin 4 Precursor 
Prostaglandin-H2 D-lsomerase (5.3.99.2) 
Vesomeral Secretory Protein 1 


2.5.1 


231# 

212U27^ 


MPA3 AMBEL 
SOXE_SULAC 


Pollen Allergen AMB A 3 (AMB A iii) 
Sulfocyanin (Blue Copper Protein) 


3.14.2 


373# 


RRF1_DESVH 


Rrfl Protein 


3.29.1 


E6.3.4 
E2.7.4 
D259# 
E2.7.1 


PURA CAEEL 
KTHY YEAST 
VA57 VACCV 
KITH_VZVW 


Adenylosuccinate Synthetase (6.3.4.4) 
Thymidyiate Kinase (2.7.4.9) 
Cuanylate Kinase Homolog 
Thymidine Kinase (2.7.1.21) 


3.47.1 


275# 
276# 


MBL BACSU 
MREB_BACSU 


MBL Protein 

Rod Shape-determining Protein Mreb 


3.48.1 


E3.1.3 


PPA5_YEAST 


Repressible Acid Phosphatase (3.1.3.2) 


3.81.1 


D281# 
282# 


AMIC PSEAE 
LUXP.VIBHA 


Aliphatic Amidase Expression-Regulator 
LUXP Protein Precursor 


4.103.1 


E2/4/2 


TOX1_BORPE 


Pertussis Toxin Su 1 (2.4.2.-) 


4.105.1 


291 # 


LECC_POLMI 


Lectin-Polyandrocarpa Misakiensis 


4.11.5 


295# 


TERP_PSESP 


Terpredoxin 


4.19.1 


E5.2.1 


FKB1_METJA 


Pept-Prolyl Cis-Trans Isomerase (5.2.1.8) 


6.5.1 


E3.6.1 
540#325# 


ATPL VIBAL 
BACT_HALVA 


ATP Synthase (3.6.1.34) (Lipid-binding) 
Sensory Rhodopsin II (Sr-li) 


7.35.4 


El .9.3 
345# 


COXB RAT 
DESRDESBI 


Cytochrome C Oxidase (1.9.3.1) (Via*) 
Desulforedoxin (Dx) 


7.7.1 


349# 


TAP„ORNMO 


Tick Anticoagulant Peptide 






(Table continues on 


following page.) 



dopsin) superfamilies into a component of the respiratory 
chain, cytochrome C oxidase II (COOX_ZOOAN). All these 
examples demonstrate the evolutionary advantage ot a do- 
main fusion event, which creates a function that is more com- 
plex than either of the components. 



Gerstein 1999; Wilson et al. 2000). Figure 4 shows a similar 
graph with the calculations extended to multi-domain pro- 
teins The figure shows that the functional divergence ot a 
single domain in multi-domain proteins thamaticallv in- 
creases, more than twofold, compared to the single-domain 
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Table 5B. Multi- Domain Proteins 



Sfam Comb. 


Funct# 


SWIS5-PROT ID 


SWISS-PROT function 


1.25.1/7.35.4 


104# 


RUBY_METjA 


Putative Rubrerythrin 




11# 
12# 

581#11# 
582#11# 


PURR HAEIN 
DECA BACSU 
SCRR STRMU 
REGA_CLOAB 


Purine Nucleotide Synthesis Repressor 
Degradation Activator 
Sucrose Operon Repressor 
Transcription Regulatory Protein Rega 


1.4.3/3.14.2 


10# 

n# 

13# 
190# 
366 # 


SKN7 YEAST 
VIRC AGRT5 
RGX3 MYCTU 
PEER PSEAE 
PETR_RHOCA 


Transcription Factor Skn7 (Pos9 Protein) 
Virg Regulatory Protein 
Sensory Transduction Protein REGX3 
Transcriptional Activator Protein Pfer 
Petr Protein 


2.45.1/7.7.1 


203#153# 


HC_RAT 


Alpha-1 -Microglobulin /Trypsin inhibitor 


2.5.1/6.5.1 


E1.9.3 


COX2„ZOOAN 


Cytochrome C Oxidase li (1.9.3.1) 


3.29.1/3.48.1 


E2.7.1 


F26„RANCA 


6~Phosphofructo-2-Kinase (2.7.1 .105) 


3.47.1/5.17.1 


1# 

1#83# 


YEDO YEAST 
GR73_MAIZE 


Heat Shock Protein 70 Homolog YEL030w 
Ig-Binding Protein 



DISCUSSION 

Here we built on our previous studies on the relationship 
between protein structure and function to develop new re- 
sults related to multi-domain proteins. Throughout the paper, 
we focused on superfamilies instead of folds, as the members 
of a superfamily are presumably of common evolutionary ori- 
gin (Murzin et al. 1995). 

We found that the 4763 multi-domain and 1818 single- 
domain proteins that met our selection criteria have about 
the same distribution of structural classes, with more enzy- 
matic functions associated with the alpha/beta structural 
classes and more non-enzymatic ones with the all-alpha and 
small classes. We identified more than three times as many 
multi-domain proteins that were enzymes than single- 
domain ones (2805 and 850, respectively) and, conversely, 
about twice as many multi-domain proteins as single-domain 
ones that were non-enzymes (1958 vs. 968). 

We focused on the functional divergence of the two 
groups and found that about a quarter of the superfamilies in 
single-domain proteins are associated with multiple func- 
tions, whereas only about a fifth of the multi-domain super- 
lamily combinations are. Therefore, we can conclude that a 
combination of specific superfamilies results in a more spe- 
cific functional assignment tor a particular protein. However, 
about one-third of the superfamilies in the multi-domain pro- 
teins were associated with multiple functions, underlining 
the lesser autonomy of a domain function in multi-domain 
protein. 

This latter finding was also supported by the difference 
in functional divergences between the two groups of proteins 
based on particular sequence similarities between the do- 
mains and SWISS-PHOT proteins. As is shown in Figure 4, the 
average functional divergence of a single domain is much 

l.iriMT 'mop 1 than twofold i m T"l!o ili.ni.,." prr-t. ir-< ill i- 



was rather surprising to us, and should be taken into consid- 
eration in functional characterization and annotation of new 
gene products. When the functions were related in single- and 
multi-domain proteins, we could observe an increasing func- 
tional complexity with the appearance of large multi-domain 
proteins. 

Altogether, with the recent sequencing of the human 
genome and the genomes of other model organisms, we hope 



C3* 30% 
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that this work can contribute to the successful annotation of 
the individual gene products, and will help to avoid some 
pitfalls associated with the functional characterization ot 
large, complex proteins. 

The publication costs ot this article were defrayed in part 
by payment of page charges. This article must therefore be 
hereby marked "advertisement" in accordance with 18 USC 
section 1734 solely to indicate this fact. 
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